SWE-bench Is Broken — And That Matters for Every Developer Picking a Model

Here are two numbers about the same model, measuring the same thing, published weeks apart.

Claude Mythos 5 on SWE-bench Verified: 95.5%. Claude Mythos Preview on SWE-bench Pro: 45.9%. The top models on the public SWE-bench Pro leaderboard score around 23%.

SWE-bench is the benchmark most developers cite when comparing AI coding tools. The numbers that appear in product announcements, blog posts, and Hacker News threads are almost always from SWE-bench Verified. And those numbers are increasingly unreliable.

What SWE-bench Actually Tests

SWE-bench was created by Princeton researchers in 2024. The idea was rigorous: take real GitHub issues from real open-source projects, ask AI models to fix them, and verify the fixes using the project's own test suite. No hand-crafted problems, no synthetic tasks — actual software engineering work.

SWE-bench Verified is a human-validated subset of 500 tasks. Annotators checked that each issue is solvable and the tests are reliable. It became the standard benchmark for AI coding capability almost immediately.

SWE-bench Pro is what Scale AI built in response to a growing problem: the models were getting too good at Verified, and not in a way that meant they were getting better at coding.

The Contamination Problem

When a benchmark becomes widely used, it gets incorporated into training data. Models see the benchmark problems — or problems very similar to them — during training, and learn to produce outputs that score well on that specific distribution of tasks. This is called benchmark contamination, and it's a known problem across every area of AI evaluation.

The 95%+ scores on SWE-bench Verified don't mean the models can solve 95% of real software engineering tasks. They mean the models are very good at the specific kinds of tasks that appear in SWE-bench Verified — a distribution that has been public and widely known for two years.

The number that matters

Top models score around 23% on the SWE-bench Pro public set — a contamination-resistant benchmark built specifically because SWE-bench Verified had become unreliable. 23% on hard, unseen problems vs. 95% on a known distribution. That's the real gap.

SWE-bench Pro is built to be contamination-resistant: newer problems, drawn from a broader set of repositories, using tasks that weren't public at training time. It's harder partly by design and partly because it's a more honest test. The 23% figure is not a gotcha — it's a more accurate picture of where the models actually are.

Why This Matters for Developers Choosing Tools

If you're choosing an AI coding tool based on benchmark comparisons, you're mostly comparing how well models have learned to perform on SWE-bench Verified. That's useful signal — a model that does well on contaminated benchmarks is still a capable model — but it's not the signal it appears to be.

The practical implication: benchmarks are a starting point, not a conclusion. A model that scores 95% on SWE-bench Verified and 46% on SWE-bench Pro is genuinely more capable than one that scores 70% and 20% — but both are much further from 'solved software engineering' than the Verified numbers suggest.

Use benchmarks to filter options, not to make final decisions — they narrow the field, they don't pick the winner
Test on your actual codebase and your actual task types — a model that's great at Python web apps may be mediocre at your embedded C++ repo
Pay attention to which benchmark is being cited — Verified vs. Pro is a meaningful distinction now
Trust your own evaluation over marketing materials — run the same prompt across multiple models on a real problem from your backlog

The Deeper Issue

Benchmark contamination is a symptom of a deeper dynamic: the evaluation and the training are in a feedback loop. As soon as a benchmark becomes the standard measure of capability, it starts shaping what models get optimised for. The benchmark stops measuring the thing it was built to measure and starts measuring benchmark performance.

This isn't unique to AI. It's Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. The difference with AI is how fast the feedback loop runs — a benchmark can go from 'reliable signal' to 'compromised measure' within a single training cycle.

SWE-bench Pro is a genuine attempt to stay ahead of this. So is the ongoing work on private evaluation sets, held-out test data, and task distributions that don't get published until after models are trained. None of these are perfect solutions — but they're the right direction.

“95% on a contaminated benchmark and 23% on a clean one aren't contradictory. They're measuring different things. The problem is when the marketing only shows you one number.”

The honest answer is that we don't yet have great tools for evaluating AI coding capability on real-world tasks at scale. SWE-bench was a significant step forward in 2024. Its successor benchmarks are getting better. In the meantime, the most reliable evaluation you can run is on your own code, your own problems, your own team's judgment.

SWE-bench Is Broken — And That Matters for Every Developer Picking a Model

What SWE-bench Actually Tests

The Contamination Problem

Why This Matters for Developers Choosing Tools

The Deeper Issue

More from the blog

Kimi K3 Is the Largest Open-Weight Model Ever. Here's What Developers Need to Know.

Supabase MCP Is Genuinely Useful. It's Also Running as Admin by Default.