Theo argues that popular AI coding benchmarks like SWE-bench Pro and Arena are badly contaminated, gamed, and poorly aligned with real developer work, so they overstate some models and understate others. He presents DeepSWE as a more realistic, behavior-focused benchmark and uses it to argue that OpenAI’s latest models materially outperform Anthropic, Gemini, and most open-weight models on real coding tasks, though he notes the benchmark still has limits.
Watch on YouTube ›Get the market thesis, key claims, assets, contradictions, and follow-up questions from any financial video — then unlock a version personalized to your portfolio, watchlist, and favorite speakers.
Theo’s core thesis is that existing coding benchmarks have become unreliable, and that a new benchmark, DeepSWE, better reflects how developers actually use coding agents. He says SWE-bench Pro in particular is contaminated, prone to cheating, and often measured with weak verification, making the results misleading. He frames DeepSWE as the benchmark he has wanted for years because it uses shorter, more natural prompts, handwritten behavioral verifiers, and novel tasks that avoid GitHub leakage and existing solution contamination. He spends much of the video explaining why the old benchmarks break down. He claims the problems are unrealistic, the prompts are often nonsensical or overly prescriptive, the repos sometimes already contain solutions, and the verification process itself is bad enough that analyzer/verifier disagreements are common. …
For immediate model selection, the video says the newest OpenAI coding models look like the safest bet, while Gemini 3.5 Flash looks poor on both quality and efficiency. Treat current benchmark headlines cautiously because harness and contamination effects can flip impressions fast.
Over the next few weeks, the key test is whether DeepSWE-style evaluations are reproduced by other teams and whether vendor-native harnesses narrow the gap. If they do not, the market will increasingly treat realistic agent benchmarks as more credible than older leaderboard scores.
The structural takeaway is that AI coding leadership will be judged by workflow realism, not synthetic score-maxing. Benchmark design is becoming part of the competitive moat, because evals that mirror real developer work will shape which models are perceived as actually deployable.
Current coding benchmarks like SWE-bench Pro are contaminated and no longer reliably measure real coding ability.
He repeatedly says the bench is contaminated, has leaked solutions, and is misleading.
DeepSWE is a more realistic benchmark because it uses short natural prompts, handwritten behavioral verification, and novel tasks without leaked solutions.
This is the central methodological case for the new benchmark.
OpenAI’s newest coding models outperform Anthropic, Gemini, and most open-weight models on the new benchmark.
He explicitly says OpenAI slaughtered the benchmark and gives model rankings.
Unlock the full claims, asset map, scores, related transcripts, follow-up questions, and AI chat — shaped around your portfolio, watchlist, favorite speakers, and risks.