Did GLM-5.2 Beat Mythos? The Specialized Harness Beat the General-Purpose One

A few days ago, the static-analysis company Semgrep published a blog post with an eye-catching title: We have Mythos at home: GLM-5.2 beats Claude in our cyber benchmarks. In one sentence: in their cybersecurity benchmark, the open-source GLM-5.2 outscored Claude. Chinese-language follow-up coverage appeared soon after.

Last time we put everyone's harness on a scale; this time, we want to talk about what a harness can actually do. Because after reading this benchmark, the thing most worth remembering is not which model won — it's a different sentence entirely.

Under the Headline: What Was Actually Measured

What Semgrep tested is one very specific class of vulnerability — IDOR (Insecure Direct Object Reference, a form of broken access control): the application exposes internal IDs directly and never checks permissions. Scoring uses F1 (the balance of recall and precision).

The headline claim — "GLM-5.2 won" — is real: GLM-5.2 scored 39% F1, while Opus 4.8 / 4.7 running inside Claude Code scored 28% (Opus 4.6, oddly, got 37%). Add that it's open-weights and costs about $0.17 per vulnerability — for an open-source model outside export controls that anyone can download, that headline has plenty of bite. As for the "Mythos" in the title: that's Anthropic's much-anticipated high-end security system (reportedly credited with unearthing over ten thousand critical vulnerabilities in internal projects), and "Mythos at home" is the joke — GLM-5.2 as the budget, everyman edition.

But if the story ended there, you'd only have seen half of it.

The Gap Isn't Only Between Models — It's Between Harnesses

Spread the full scorecard out, and the largest gap isn't between model and model; it's between harness and harness:

Configuration (model × harness)	IDOR F1
Semgrep multimodal × GPT-5.5	61%
Semgrep multimodal × Opus 4.8	53%
GLM-5.2 (open weights)	39%
Claude Code × Opus 4.6	37%
Claude Code × Opus 4.8 / 4.7	28%
GPT-5.5 (Codex, bare prompt)	20%

Look at two controlled pairs — same model, only the harness swapped:

GPT-5.5: used bare (Codex), just 20%; strapped into Semgrep's own specialized security harness, it shoots to 61% — three times over.
Opus 4.8: inside the general-purpose coding harness Claude Code, 28%; in Semgrep's specialized harness, 53% — nearly double.

Same "brain," different harness, wildly different grades. Semgrep condensed the finding into one heavy sentence:

The largest performance gap in the table is not between models, but between configurations that get endpoint discovery and those that don't.

The harness still matters more than the model.

In our view, a more even-handed way to put it is this — once the model is fixed, the harness becomes the deciding factor; and when models differ, the choice of harness matters even more. The harness is a part of the system that is just as critical, yet has been underrated for far too long: the model matters, and the harness matters; if either piece is missing, the whole agent struggles to get anything done. What this benchmark really lifts the lid on isn't "the harness crushes the model" — it's "the same model, in a different harness, can differ by a factor of three": a variable underrated for far too long, finally put on the table. In the end, a harness doesn't replace the model; it lets you actually bring the model's power to bear — a good model still needs a good harness to perform at full strength.

One aside, and not entirely a digression: harness, the word Semgrep uses, is the trade's own term for the gear wrapped around a model — tools, scaffolding, context orchestration, the agent framework. It's also, literally, the tack you strap on a draft animal — which is exactly the metaphor this series of ours runs on. The model is the animal that supplies the strength; the harness decides where that strength points, and whether it can be used at all. So "the harness still matters more than the model" and this whole "harness" series of ours are talking about the same word, and the same thing.

What did their harness actually do? It enumerated the application's endpoints for the model, one by one, then used code to filter the context down to only the parts that matter — freeing the model from fishing needles out of the haystack, and leaving it only the judgment calls it does best.

A Specialized Harness Beats a General-Purpose One

This is the point in the benchmark most worth remembering.

The top two entries (61%, 53%) are not "the strongest models" — they are models "fitted into the harness that understands the job best." Claude Code and Codex are excellent general-purpose coding harnesses — but they were never specifically optimized for hunting broken access control; Semgrep's multimodal harness is specialized — it knows endpoints, knows access control, knows which slice of context to feed the model. So the benchmark lays out a plain but forceful conclusion:

In this benchmark, a specialized harness with a model on an ordinary day beats a general-purpose harness with a top-tier model.

Of course, the benchmark deserves its caveats — Semgrep itself reminds readers this is "one task, one dataset, one run"; IDOR detection is inherently non-deterministic and the dataset is limited, so don't rush to extrapolate to every vulnerability class. But the direction is clear enough: in specialized domains, the harness is one of the key factors deciding how a model performs — a variable long underrated, yet weighty enough to decide the outcome.

The More Specialized the Job, the More It Needs a Specialized Harness

By this point, you can probably guess why this made us smile.

We wrote earlier that the innate limits of the model meant we could never slug it out with the giants on raw coding ability; so from day one, AVL Code took a different road — building a specialized harness that knows security: built-in executable-format parsing, hashing and entropy, string and IOC extraction, PE / ELF / Mach-O parsing, disassembly, LLM-driven decompilation, YARA matching — plus the read-only hard line on samples/. In other words, what we've been doing follows the same line of thought as the Semgrep harness that doubled those scores: choose the model well — but above all, polish the harness until nothing understands the job better.

Semgrep's benchmark amounts to an external validation run on our behalf: for security work, fitting a harness that knows security matters as much as picking a strong enough model — it's just that the former is chronically underrated. It also echoes what we've held all along — the contest of models is still at full boil, but beyond the model, the harness equally decides whether it can get the job done in your real-world scenario.

We're also clear that one external benchmark is far from the finish line. Testing "real-world capability" is by nature a long, ongoing process; Semgrep's entry is just a starting point — a sign the industry is beginning to take "harness capability" seriously. As AVL Code sees deeper and wider use, the Landi team will publish what we observe ourselves — promptly, and honestly.

For Security Work, Fit a Harness That Knows Security

To close in one sentence: the model matters, and the harness matters — for security work, fit a harness that knows security.

Headlines will change — GLM-5.2 leads today, and some other model will top the chart tomorrow; but "a specialized harness beats a general-purpose one" will, in all likelihood, keep holding true. If the work on your bench is security analysis, binary triage, or reverse-engineering attribution, then beyond choosing a good model, don't skip the step that matters just as much: fit it with a harness built for this exact job, and let the model's power truly land.

A good model deserves a good saddle — get your harness from Antiy. We're waiting for you at avlcode.cn — riding a donkey, in a harness that knows you.

References: Semgrep, "We have Mythos at home: GLM-5.2 beats Claude in our cyber benchmarks"; FreeBuf, "智谱 AI 新模型 GLM-5.2 在漏洞检测领域比肩 Claude Mythos" (in Chinese). Figures cited from the sources above; the benchmark is a single-task, single-dataset, non-deterministic test — for reference only.

AVL Code — the AVL security engine, with intelligence at your side. From the Antiy Landi team.