Models
Frontier & open-weight models compared by capability — columns are flagship models, rows are the headline benchmarks that measure each one. Every number is the published figure, restated with a link to the source it came from · as of 2026-06-14. Curated matrix refreshed 20d ago; live leaderboards auto-track on the 6-hour cron.
We cite the benchmark authorities rather than re-rank them — Epoch AI, LMArena and Artificial Analysis. ChangeRadar's job is the changes: what gets released, repriced, deprecated, or quietly shifts behind a stable model id.
Vendor champions · best flagship per vendor · GPQA Diamond
Each vendor's most recent publicly-available flagship, ordered by score (highest first) — Google · OpenAI · Anthropic · xAI highlighted. The weekly market-watch surfaces new releases automatically; one tagged reported is the latest release shown with vendor-reported scores (linked to source) until we independently cite it. Score = GPQA Diamond; every number links to its source.
Compare two models
Agentic coding
| Benchmark | Claude Fable 5 | Gemini 3.1 Pro | Δ |
|---|---|---|---|
| SWE-bench Verified i% resolved (pass@1) | 95 | 80.6 | +14.4 |
| SWE-bench Pro i% resolved (pass@1) | 80.3 | 54.2 | +26.1 |
3 coverage gaps — only one model reports these
- Terminal-Bench — Gemini 3.1 Pro 68.5
- LiveCodeBench — Gemini 3.1 Pro 2887
- FrontierCode — Claude Fable 5 29.3
Tool use & agents
5 coverage gaps — only one model reports these
- TAU-bench — Gemini 3.1 Pro 99.3
- OSWorld — Claude Fable 5 85
- BrowseComp — Gemini 3.1 Pro 85.9
- GDPval-AA — Claude Fable 5 1932
- MCP Atlas — Gemini 3.1 Pro 69.2
Science & reasoning
| Benchmark | Claude Fable 5 | Gemini 3.1 Pro | Δ |
|---|---|---|---|
| GPQA Diamond i% accuracy | 92.6 | 94.3 | -1.7 |
2 coverage gaps — only one model reports these
- Humanity's Last Exam — Gemini 3.1 Pro 44.4
- ARC-AGI-2 — Gemini 3.1 Pro 77.1
General knowledge
1 coverage gap — only one model reports these
- MMMLU — Gemini 3.1 Pro 92.6
Multimodal
2 coverage gaps — only one model reports these
- MMMU — Gemini 3.1 Pro 80.5
- GDP.pdf — Claude Fable 5 29.8
Long context
1 coverage gap — only one model reports these
- MRCR — Gemini 3.1 Pro 84.9
Shared benchmarks first (with Δ when both report the same scale); one-sided coverage collapses below. Pick any two models — or a champion above — and the URL becomes shareable.
All models · benchmark matrix
reported columns (Claude Opus 4.8, Mistral Large 3) are auto-discovered by our weekly market-watch from each vendor's own reported numbers — not independently verified, and shown when a vendor ships a model newer than the hand-cited column beside it. Full claim sets are in Vendor-reported benchmarks below.
Agentic coding
Can the model fix real bugs, ship features, and operate a dev environment end-to-end as a coding agent — the single most-watched capability in 2026 vendor launches.
| Benchmark | Fable 5 | Opus 4.8 reported | Opus 4.8 | Mythos 5 | GPT-5.5 | Gemini 3.1 Pro | Grok 4.3 | DeepSeek-V4-Pro open | Qwen3.7-Max | Kimi K2.6 open | GLM-5.2 open | Llama 4 Maverick open | Mistral Medium 3.5 open | Mistral Large 3 reported | Gemma 3 27B open |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SWE-bench Verified i % resolved (pass@1) | 95 | 88.6 | 88.6 | — | — | 80.6 | — | 80.6 | 80.4 | 80.2 | — | — | 77.6 | — | — |
| SWE-bench Pro i % resolved (pass@1) | 80.3 | 69.2 | 69.2 | 77.8 | 58.6 | 54.2 | — | — | 60.6 | 58.6 | 62.1 | — | — | — | — |
| Terminal-Bench i % solved (pass@1) | — | 74.6 | 74.6 | 88 | 82.7 | 68.5 | — | 67.9 | 69.7 | 66.7 | 81 | — | — | — | — |
| LiveCodeBench i % pass@1 | — | — | — | — | — | 2887 | — | 93.5 | — | 89.6 | — | 43.4 | — | — | 29.7 |
Tool use & agents
Beyond writing code: can the model select and chain tools, follow policy, drive a computer/browser, and complete long-horizon multi-step tasks.
| Benchmark | Fable 5 | Opus 4.8 reported | Opus 4.8 | Mythos 5 | GPT-5.5 | Gemini 3.1 Pro | Grok 4.3 | DeepSeek-V4-Pro open | Qwen3.7-Max | Kimi K2.6 open | GLM-5.2 open | Llama 4 Maverick open | Mistral Medium 3.5 open | Mistral Large 3 reported | Gemma 3 27B open |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TAU-bench i % pass / pass^k | — | — | — | — | — | 99.3 | 98 | — | — | — | — | — | — | — | — |
| OSWorld i % success | 85 | 83.4 | 83.4 | — | 78.7 | — | — | — | — | 73.1 | — | — | — | — | — |
| BrowseComp i % accuracy | — | — | 84.3 | — | 90.1 | 85.9 | — | — | — | 83.2 | — | — | — | — | — |
Math
Competition and research-level mathematical reasoning, increasingly reported on uncontaminated/post-cutoff problem sets.
| Benchmark | Fable 5 | Opus 4.8 reported | Opus 4.8 | Mythos 5 | GPT-5.5 | Gemini 3.1 Pro | Grok 4.3 | DeepSeek-V4-Pro open | Qwen3.7-Max | Kimi K2.6 open | GLM-5.2 open | Llama 4 Maverick open | Mistral Medium 3.5 open | Mistral Large 3 reported | Gemma 3 27B open |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HMMT i % accuracy (pass@1) | — | — | — | — | — | — | — | — | 97.1 | 92.7 | — | — | — | — | — |
| MATH i % accuracy | — | — | — | — | — | — | — | — | — | — | — | 61.2 | — | 93.6 | 89 |
Science & reasoning
Expert-level, Google-proof reasoning across the sciences and broad academia — the benchmarks vendors point to when claiming 'PhD-level' or 'frontier' reasoning.
| Benchmark | Fable 5 | Opus 4.8 reported | Opus 4.8 | Mythos 5 | GPT-5.5 | Gemini 3.1 Pro | Grok 4.3 | DeepSeek-V4-Pro open | Qwen3.7-Max | Kimi K2.6 open | GLM-5.2 open | Llama 4 Maverick open | Mistral Medium 3.5 open | Mistral Large 3 reported | Gemma 3 27B open |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPQA Diamond i % accuracy | 92.6 | 93.6 | 93.6 | — | 93.6 | 94.3 | 90.1 | 90.1 | 92.4 | 90.5 | 91.2 | 69.8 | — | — | 42.4 |
| Humanity's Last Exam i % accuracy | — | 49.8 | 57.9 | 64.5 | 57.2 | 44.4 | — | 37.7 | 41.4 | 54 | 40.5 | — | — | — | — |
| ARC-AGI-2 i % solved | — | — | — | — | 85 | 77.1 | — | — | — | — | — | — | — | — | — |
General knowledge
Broad multi-subject factual and reasoning coverage; the classic 'how much does it know' bucket, now reported via the harder Pro variant since base MMLU is saturated.
Multimodal
Vision + language understanding and visual reasoning — how well the model interprets images, diagrams, charts and figures.
Long context
Retrieval and reasoning quality as context length grows into the hundreds-of-thousands / millions of tokens — beyond simple needle-in-a-haystack.
Human preference
Aggregate real-user preference from blind head-to-head comparisons — the closest thing to a 'do people actually like the answers' metric, and the one number vendors love to top.
Showing 13 flagship models across the headline benchmarks; 53 models and
84 benchmarks tracked in total from 60 primary & aggregator sources.
Claude Mythos 5 numbers are limited (access-restricted preview).
Numbers are published facts, restated with a per-cell source link; vendor benchmark charts are linked to their source, not rehosted.
Sources include: 9to5Google (Gemini 3 Flash launch coverage) · AIFire (citing OpenAI GPT-5.2 release) · Artificial Analysis · BenchLM.ai (citing LMArena) · BinaryVerse AI (xAI official figures) · BuildFastWithAI (citing OpenAI GPT-5.5 launch) · Caylent · Codersera (reporting Moonshot's figures) · DataCamp · DataCamp (reproducing Meta's Llama 4 launch chart) · DeepSeek-AI (arXiv 2512.02556) · DeepSeek-AI (Hugging Face model card) · Google (official Gemini 2.5 launch blog) · Google (official Gemini 3 Flash launch blog) · Google (official Gemini 3 launch blog) · Google DeepMind (Gemini 3.1 Pro model card) · +36 more.
Vendor-reported benchmarks
Numbers as claimed by the vendor on their own model/system card — not independently verified and often measured with a favourable harness. We track each vendor's claims over time and link to the source; cross-check against the cited matrix above.
| SWE-bench Verified | 88.6% |
| SWE-bench Pro | 69.2% |
| Terminal-Bench 2.1 · Terminus-2 public harness | 74.6% |
| GPQA Diamond | 93.6% |
| Humanity's Last Exam · without tools | 49.8% |
| Humanity's Last Exam · with tools | 57.9% |
| OSWorld-Verified | 83.4% |
| Online-Mind2Web · browser agent benchmark | 84% |
| USAMO 2026 · Olympic-level mathematical proofs | 96.7% |
| MCP-Atlas · multi-step tool-calling | 82.2% |
| GDPval-AA · economically valuable knowledge work | 1890 Elo |
| Terminal-Bench 2.0 | 82.7% |
| SWE-Bench Pro | 58.6% |
| SWE-Bench Verified | 88.7% |
| GDPval · 44 occupations | 84.9% |
| OSWorld-Verified | 78.7% |
| ARC-AGI-2 · Verified | 85.0% |
| FrontierMath Tier 4 | 35.4% |
| Expert-SWE · internal frontier eval | 73.1% |
| MRCR v2 (1M tokens) · long-context retrieval at 512K-1M tokens | 74.0% |
| CyberGym | 81.8% |
| Tau2-Bench Telecom · without prompt tuning | 98.0% |
| Humanity's Last Exam (no tools) | 41.4% |
| GPQA Diamond · Graduate-level science reasoning; from Artificial Analysis and multiple sources | 90.1% |
| Tau-Bench (τ²-Bench) · Tool-use and agentic benchmark | 97.7% |
| GDPval-AA · Agentic task performance; xAI-reported improvement of 321 points from Grok 4.20 | 1500 Elo |
| Artificial Analysis Intelligence Index · High reasoning mode on v4.1; composite of 9 evaluations | 38 index |
| SciCode · Code generation and problem-solving | 47.3% |
| GPQA Diamond | 94.3% |
| ARC-AGI-2 | 77.1% |
| MATH | 95.1% |
| SWE-Bench Verified | 80.6% |
| Humanity's Last Exam | 44.4% |
| Terminal-Bench 2.0 | 68.5% |
| LiveCodeBench Pro | 2887 Elo |
| BrowseComp | 85.9% |
| MCP Atlas | 69.2% |
| τ-Bench Telecom | 99.3% |
| APEX-Agents | 33.5% |
| MMMLU | 92.6% |
| GPQA Diamond | 92.4% |
| SWE-Bench Pro | 60.6 pass@1 |
| Terminal-Bench 2.0-Terminus | 69.7 % |
| Humanity's Last Exam (HLE) | 41.4% |
| HMMT 2026 Feb | 97.1 % |
| IMOAnswerBench | 90 % |
| Apex | 44.5 % |
| MCP-Atlas | 76.4 % |
| MCP-Mark | 60.8 % |
| SWE-Bench Verified | 80.4 pass@1 |
| LiveCodeBench | 91.6 % |
| SpreadSheetBench-v1 | 87 % |
| GPQA Diamond · V4-Pro-Max | 90.1% |
| SWE-bench Verified · V4-Pro-Max | 80.6% |
| MMLU-Pro · V4-Pro-Max | 87.5% |
| LiveCodeBench · V4-Pro-Max, Pass@1 | 93.5% |
| Humanity's Last Exam · V4-Pro-Max, Pass@1 | 37.7% |
| AIME 2025 · V4-Pro-Max, Pass@1 on HMMT 2026 Feb | 95.2% |
| Codeforces · V4-Pro-Max | 3206 Rating |
| Terminal-Bench 2.0 · V4-Pro-Max | 67.9% |
| SimpleQA-Verified · V4-Pro-Max | 57.9% |
| SWE-Bench Multilingual · V4-Pro-Max | 76.2% |
| IMOAnswerBench · V4-Pro-Max | 89.8% |
| MMLU Pro | 80.5% |
| GPQA Diamond | 69.8% |
| LiveCodeBench · averaged over multiple generations | 43.4 pass@1 |
| HumanEval | 86.4% |
| Multilingual MMLU | 84.6% |
| GSM8K | 95.2% |
| MATH-500 | 85.3% |
| SWE-bench Verified | 74.2% |
| MMLU-Pro · Independent evaluation via LayerLens/Atlas | 73.11% |
| MATH-500 · Independent evaluation via LayerLens/Atlas | 93.60% |
| HumanEval · Python code generation | 90.24% |
| AGIEval English · Academic multiple-choice knowledge | 74.00% |
| SWE-Bench Pro · Thinking mode enabled | 58.6% |
| SWE-Bench Verified · Thinking mode enabled | 80.2% |
| Humanity's Last Exam (with tools) · Thinking mode enabled | 54.0% |
| AIME 2026 · Thinking mode enabled, avg@32 | 96.4% |
| GPQA-Diamond · Thinking mode enabled, avg@8 | 90.5% |
| LiveCodeBench v6 · Thinking mode enabled | 89.6% |
| Terminal-Bench 2.0 · Terminus-2 framework, thinking mode | 66.7% |
| BrowseComp · Standard mode | 83.2% |
| BrowseComp (Agent Swarm) · Agent Swarm mode | 86.3% |
| DeepSearchQA (F1) · F1 score metric | 92.5% |
| Toolathlon · Thinking mode enabled | 50.0% |
| SWE-bench Pro · 400K context, temp=1, top_p=1, max_tokens=32k | 62.1 % |
| Terminal-Bench 2.1 · Claude Code 2.1.167, temp=1.0, top_p=0.95, 5 runs averaged | 81.0 % |
| FrontierSWE (Dominance) · 1M context, max effort level, 128K max output tokens | 74.4 % |
| PostTrainBench · 1M context, max effort level, 128K max output tokens | 34.3 % |
| SWE-Marathon · 1M context, max effort level, 128K max output tokens | 13.0 % |
| MCP-Atlas · Public set, think mode, 500-task subset, 10-min timeout | 76.8 % |
| Tool-Decathlon · Official evaluation service, max_token=128K | 48.2 % |
| AIME 2026 · Math competition benchmark | 99.2 % |
| GPQA-Diamond · Graduate-level science Q&A | 91.2 % |
| Humanity's Last Exam (with tools) · With external tools enabled | 54.7 % |
| NL2Repo · Repository generation task, 400K context | 58.2 % |
| DeepSWE · Official evaluation framework, 400K context | 46.2 % |
Model changes
New releases, deprecations, and benchmark score moves we’ve recorded — newest first.
-
xAI reported benchmarks updated
Grok 4.3: 5 benchmark claims (via web search)
-
LMArena (text) scores changed
Arena Elo (text, overall) updated.
-
NVIDIA Nemotron models changed
Added nvidia/MiniMax-M2.7-DFlash model with other license type, available until 2026-06-16.
-
Meta reported benchmarks updated
Llama 4 Maverick: 8 benchmark claims (via web search)
-
deepseek-coder-1.3b-base: Epoch Capabilities Index (ECI) ↑ 63.596 → 63.813
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
-
Cerebras-GPT-13B: Epoch Capabilities Index (ECI) ↑ 82.645 → 82.79
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
-
starcoder2-3b: Epoch Capabilities Index (ECI) ↑ 88.205 → 88.329
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
-
deepseek-coder-6.7b-base: Epoch Capabilities Index (ECI) ↑ 89.062 → 89.184
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
-
dolly-v2-12b: Epoch Capabilities Index (ECI) ↑ 89.113 → 89.235
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
-
Baichuan-7B: Epoch Capabilities Index (ECI) ↑ 89.891 → 90.006
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
-
phi-1_5: Epoch Capabilities Index (ECI) ↑ 91.015 → 91.132
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
-
xgen-7b-8k-base: Epoch Capabilities Index (ECI) ↑ 92.881 → 92.989
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
-
starcoder2-7b: Epoch Capabilities Index (ECI) ↑ 93.025 → 93.132
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
-
gemma-2b: Epoch Capabilities Index (ECI) ↑ 93.684 → 93.789
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
-
mpt-7b: Epoch Capabilities Index (ECI) ↑ 94.078 → 94.181
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
-
falcon-7b: Epoch Capabilities Index (ECI) ↑ 94.588 → 94.689
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
-
deepseek-coder-33b-base: Epoch Capabilities Index (ECI) ↑ 95.816 → 95.912
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
-
Baichuan-2-7B-Base: Epoch Capabilities Index (ECI) ↑ 95.829 → 95.924
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
-
LLaMA-7B: Epoch Capabilities Index (ECI) ↑ 96.2 → 96.296
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
-
Llama-2-7b: Epoch Capabilities Index (ECI) ↑ 98.591 → 98.678
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
-
LLaMA-13B: Epoch Capabilities Index (ECI) ↑ 100.124 → 100.207
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
-
mpt-30b: Epoch Capabilities Index (ECI) ↑ 100.202 → 100.284
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
-
mpt-30b-instruct: Epoch Capabilities Index (ECI) ↑ 100.202 → 100.284
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
-
INTELLECT-1-Instruct: Epoch Capabilities Index (ECI) ↑ 100.403 → 100.484
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
-
Qwen2.5-Coder-1.5B: Epoch Capabilities Index (ECI) ↑ 102.557 → 102.629
Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.