Skip to main content
← Back to blog

Open Source AI in 2026: The Gap Moved

April 28, 2026 · 7 min read

The best frontier model in 2026 is a moving target. SWE-bench Verified leadership has changed hands three times in the past six months. Terminal-Bench 2.0 has turned over four times. Both closed and open labs ship roughly monthly, and the open frontier is now within a few points of the closed flagships at 3× to 26× lower cost.

That cost gap remains wide. A coding step on Claude Opus 4.7 runs $25 per million output tokens. Those same million tokens on DeepSeek V4 cost $3.48, and DeepSeek achieves accuracy on Terminal-Bench 2.0 within 1.5% of Opus. A static integration that pins to a single closed model pays the closed-frontier price on every step.

The plots below track every major checkpoint on SWE-bench Verified since October 2024 and Terminal-Bench 2.0 since its release.

SWE-bench Verified40%50%60%70%80%90%Oct '24Apr '25Oct '25Apr '26Release date of modelAccuracy (%)GPT-5Opus 4.7DeepSeek V3Qwen3-MaxGLM-5MiniMax-M2.5DeepSeek V4Closed-weightOpen-weight

Accuracy vs release day on SWE-bench Verified.

Terminal-Bench 2.040%50%60%70%80%90%Aug '25Nov '25Feb '26Apr '26Release date of modelAccuracy (%)GPT-5.4Opus 4.7DeepSeek V3.2MiniMax-M2.7Qwen3.6-MaxGLM-5.1DeepSeek V4Closed-weightOpen-weight

Accuracy vs release day on Terminal-Bench 2.0.

Two things from these plots stand out. Density: 2025-2026 looks nothing like 2024. This is what monthly cadence looks like in aggregate. Turnover: no lab held the top of either benchmark for more than a few months.

The cost gap is just as wide as the accuracy gap is narrow. Closed flagships sit at $10 to $25 per million output tokens. Open frontiers sit at $0.42 to $3.48, 3× to 60× lower.

SWE-bench Verified40%50%60%70%80%90%$0$5$10$15$20$25Accuracy (%)$/M output tokensSonnet 3.7GPT-5Sonnet 4.5Opus 4.6Gemini 3.1 ProOpus 4.7DeepSeek R1Qwen3-MaxKimi K2.5MiniMax-M2.5GLM-5DeepSeek V4Closed-weightOpen-weight

SWE-bench Verified accuracy vs output cost ($/M tokens).

Terminal-Bench 2.040%50%60%70%80%90%$0$5$10$15$20$25Accuracy (%)$/M output tokensSonnet 4.5Opus 4.6Gemini 3.1 ProGPT-5.4Opus 4.7GLM-4.7MiniMax-M2.7Qwen3.6-MaxGLM-5.1DeepSeek V4Closed-weightOpen-weight

Terminal-Bench 2.0 accuracy vs output cost ($/M tokens).

The cheapest point on SWE-bench is DeepSeek V3.2 in December 2025 at less than a penny per accuracy point ($0.42 / 73.1%); the most expensive is Claude Opus 4 in May 2025 at over a dollar ($75 / 72.5%). At the latest snapshot, Claude Opus 4.7 at $0.29 per point is roughly 7× more expensive than DeepSeek V4 ($0.04) and 24× more expensive than MiniMax-M2.5 ($0.012). The Pareto frontier on SWE-bench Verified is three points: MiniMax-M2.5 at $0.95, DeepSeek V4 at $3.48, and Claude Opus 4.7 at $25. DeepSeek V4 alone undercuts Gemini 3.1 Pro by 3.5× at the same 80.6% accuracy.

The open-weight side is also now the fastest-growing cohort in the industry. DeepSeek shipped nine major checkpoints in sixteen months. Qwen, GLM, and MiniMax all moved from absent to frontier-adjacent inside a year.

No single model dominates the snapshot. GPT-5.4 leads Terminal-Bench. Claude Opus 4.7 leads SWE-bench. The cheap open frontiers (DeepSeek V4, MiniMax-M2.5, GLM-5, Qwen) lead on cost-per-point and sit within striking distance on accuracy. A routing step wants a tiny, cheap model. A simple coding step wants the cheapest open model that clears the bar. One hard reasoning step might justify the closed flagship at 26× the price. Meanwhile, the optimal mix moves every time a new release lands.

Static integrations age in weeks

A production agent wired to a single model is fragile twice over. Fragile in time, because the cadence makes today’s choice obsolete in weeks. Fragile in cost, because pinning to one model usually means paying the closed-frontier price on steps where an open model would suffice at a tenth of the cost. The fix is the same in both directions: serving infrastructure that switches automatically across models, providers, and configurations as the landscape moves.

What survives the next eighteen months is serving infrastructure that absorbs the cadence and orchestrates across cost and capability automatically. It has to pick the right model for each step, rewrite the harness as traffic teaches it what works, and switch models the moment a new release lands, before the team even hears about it.

Motus: Agents that Learn in Prod

Motus helps agents learn how to move with the frontier. Every time a new model ships, from any provider, open or closed, Motus benchmarks it against your production traffic and autonomously orchestrates each step to whichever model suits it best. The same traces continuously sharpen the agent harness, model orchestration, memory strategy, and latency.

Two results worth highlighting. On SWE-bench Verified with the mini-swe-agent-v2 harness, Claude Opus 4.6 alone reaches 75.8%; Motus reaches 79% at 2.3× lower cost. On Terminal-Bench 2.0 with the Terminus 2 harness, Claude Opus 4.6 alone reaches 64%; Motus harness optimization brings that to 77.5%, and model orchestration further to 80.1%, at 2.4× lower cost. Both come from agents that autonomously learn with Motus.

The flywheel matters more than any single number. Motus turns the same forces that break traditional agents into momentum that improves their performance. Traffic feeds learning, leading to frontier agents at lower cost.

Ship your agents today. The tokens are on us during early preview. Check out our repo and join us on Slack.


Sources: vendor release posts and model cards from OpenAI, Anthropic, Google, DeepSeek, Alibaba (Qwen), Z.AI (GLM), and MiniMax, April 2024 through April 2026. SWE-bench Verified scores are based on model providers and scaffolds vary; treat single-digit gaps qualitatively. Cost is output price per million tokens at standard tier.