Models

Frontier & open-weight models compared by capability — columns are flagship models, rows are the headline benchmarks that measure each one. Every number is the published figure, restated with a link to the source it came from · as of 2026-06-14. Curated matrix refreshed 20d ago; live leaderboards auto-track on the 6-hour cron.

We cite the benchmark authorities rather than re-rank them — Epoch AI, LMArena and Artificial Analysis. ChangeRadar's job is the changes: what gets released, repriced, deprecated, or quietly shifts behind a stable model id.

Vendor champions · best flagship per vendor · GPQA Diamond

Google
Gemini 3.1 Pro
94.3GPQA Diamond
Anthropic reported
Claude Opus 4.8
93.6GPQA Diamond
Source ↗
OpenAI
GPT-5.5
93.6GPQA Diamond
Mistral AI reported
Mistral Large 3
93.6MATH
Source ↗
Alibaba
Qwen3.7-Max
92.4GPQA Diamond
Zhipu AI (Z.ai) open
GLM-5.2
91.2GPQA Diamond
Moonshot AI open
Kimi K2.6
90.5GPQA Diamond
xAI
Grok 4.3
90.1GPQA Diamond
DeepSeek open
DeepSeek-V4-Pro
90.1GPQA Diamond
Meta open
Llama 4 Maverick
69.8GPQA Diamond

Each vendor's most recent publicly-available flagship, ordered by score (highest first) — Google · OpenAI · Anthropic · xAI highlighted. The weekly market-watch surfaces new releases automatically; one tagged reported is the latest release shown with vendor-reported scores (linked to source) until we independently cite it. Score = GPQA Diamond; every number links to its source.

Compare two models

2 wins · Claude Fable 5 1 wins · Gemini 3.1 Pro 0 ties

Agentic coding

BenchmarkClaude Fable 5Gemini 3.1 ProΔ
SWE-bench Verified i% resolved (pass@1) 95 80.6 +14.4
SWE-bench Pro i% resolved (pass@1) 80.3 54.2 +26.1
3 coverage gaps — only one model reports these
  • Terminal-Bench — Gemini 3.1 Pro 68.5
  • LiveCodeBench — Gemini 3.1 Pro 2887
  • FrontierCode — Claude Fable 5 29.3

Tool use & agents

5 coverage gaps — only one model reports these
  • TAU-bench — Gemini 3.1 Pro 99.3
  • OSWorld — Claude Fable 5 85
  • BrowseComp — Gemini 3.1 Pro 85.9
  • GDPval-AA — Claude Fable 5 1932
  • MCP Atlas — Gemini 3.1 Pro 69.2

Science & reasoning

BenchmarkClaude Fable 5Gemini 3.1 ProΔ
GPQA Diamond i% accuracy 92.6 94.3 -1.7
2 coverage gaps — only one model reports these
  • Humanity's Last Exam — Gemini 3.1 Pro 44.4
  • ARC-AGI-2 — Gemini 3.1 Pro 77.1

General knowledge

1 coverage gap — only one model reports these
  • MMMLU — Gemini 3.1 Pro 92.6

Multimodal

2 coverage gaps — only one model reports these
  • MMMU — Gemini 3.1 Pro 80.5
  • GDP.pdf — Claude Fable 5 29.8

Long context

1 coverage gap — only one model reports these
  • MRCR — Gemini 3.1 Pro 84.9

Shared benchmarks first (with Δ when both report the same scale); one-sided coverage collapses below. Pick any two models — or a champion above — and the URL becomes shareable.

All models · benchmark matrix

Vendor
Capability
15 of 15 columns shown

reported columns (Claude Opus 4.8, Mistral Large 3) are auto-discovered by our weekly market-watch from each vendor's own reported numbers — not independently verified, and shown when a vendor ships a model newer than the hand-cited column beside it. Full claim sets are in Vendor-reported benchmarks below.

Agentic coding

Can the model fix real bugs, ship features, and operate a dev environment end-to-end as a coding agent — the single most-watched capability in 2026 vendor launches.

Benchmark Fable 5 Opus 4.8 reported Opus 4.8 Mythos 5 GPT-5.5 Gemini 3.1 Pro Grok 4.3 DeepSeek-V4-Pro open Qwen3.7-Max Kimi K2.6 open GLM-5.2 open Llama 4 Maverick open Mistral Medium 3.5 open Mistral Large 3 reported Gemma 3 27B open
SWE-bench Verified i % resolved (pass@1) 95 88.6 88.6 80.6 80.6 80.4 80.2 77.6
SWE-bench Pro i % resolved (pass@1) 80.3 69.2 69.2 77.8 58.6 54.2 60.6 58.6 62.1
Terminal-Bench i % solved (pass@1) 74.6 74.6 88 82.7 68.5 67.9 69.7 66.7 81
LiveCodeBench i % pass@1 2887 93.5 89.6 43.4 29.7

Tool use & agents

Beyond writing code: can the model select and chain tools, follow policy, drive a computer/browser, and complete long-horizon multi-step tasks.

Benchmark Fable 5 Opus 4.8 reported Opus 4.8 Mythos 5 GPT-5.5 Gemini 3.1 Pro Grok 4.3 DeepSeek-V4-Pro open Qwen3.7-Max Kimi K2.6 open GLM-5.2 open Llama 4 Maverick open Mistral Medium 3.5 open Mistral Large 3 reported Gemma 3 27B open
TAU-bench i % pass / pass^k 99.3 98
OSWorld i % success 85 83.4 83.4 78.7 73.1
BrowseComp i % accuracy 84.3 90.1 85.9 83.2

Math

Competition and research-level mathematical reasoning, increasingly reported on uncontaminated/post-cutoff problem sets.

Benchmark Fable 5 Opus 4.8 reported Opus 4.8 Mythos 5 GPT-5.5 Gemini 3.1 Pro Grok 4.3 DeepSeek-V4-Pro open Qwen3.7-Max Kimi K2.6 open GLM-5.2 open Llama 4 Maverick open Mistral Medium 3.5 open Mistral Large 3 reported Gemma 3 27B open
HMMT i % accuracy (pass@1) 97.1 92.7
MATH i % accuracy 61.2 93.6 89

Science & reasoning

Expert-level, Google-proof reasoning across the sciences and broad academia — the benchmarks vendors point to when claiming 'PhD-level' or 'frontier' reasoning.

Benchmark Fable 5 Opus 4.8 reported Opus 4.8 Mythos 5 GPT-5.5 Gemini 3.1 Pro Grok 4.3 DeepSeek-V4-Pro open Qwen3.7-Max Kimi K2.6 open GLM-5.2 open Llama 4 Maverick open Mistral Medium 3.5 open Mistral Large 3 reported Gemma 3 27B open
GPQA Diamond i % accuracy 92.6 93.6 93.6 93.6 94.3 90.1 90.1 92.4 90.5 91.2 69.8 42.4
Humanity's Last Exam i % accuracy 49.8 57.9 64.5 57.2 44.4 37.7 41.4 54 40.5
ARC-AGI-2 i % solved 85 77.1

General knowledge

Broad multi-subject factual and reasoning coverage; the classic 'how much does it know' bucket, now reported via the harder Pro variant since base MMLU is saturated.

Benchmark Fable 5 Opus 4.8 reported Opus 4.8 Mythos 5 GPT-5.5 Gemini 3.1 Pro Grok 4.3 DeepSeek-V4-Pro open Qwen3.7-Max Kimi K2.6 open GLM-5.2 open Llama 4 Maverick open Mistral Medium 3.5 open Mistral Large 3 reported Gemma 3 27B open
MMLU-Pro i % accuracy 87.5 80.5 73.11 67.5

Multimodal

Vision + language understanding and visual reasoning — how well the model interprets images, diagrams, charts and figures.

Benchmark Fable 5 Opus 4.8 reported Opus 4.8 Mythos 5 GPT-5.5 Gemini 3.1 Pro Grok 4.3 DeepSeek-V4-Pro open Qwen3.7-Max Kimi K2.6 open GLM-5.2 open Llama 4 Maverick open Mistral Medium 3.5 open Mistral Large 3 reported Gemma 3 27B open
MMMU i % accuracy 83.2 80.5 79.4 73.4 64.9

Long context

Retrieval and reasoning quality as context length grows into the hundreds-of-thousands / millions of tokens — beyond simple needle-in-a-haystack.

Benchmark Fable 5 Opus 4.8 reported Opus 4.8 Mythos 5 GPT-5.5 Gemini 3.1 Pro Grok 4.3 DeepSeek-V4-Pro open Qwen3.7-Max Kimi K2.6 open GLM-5.2 open Llama 4 Maverick open Mistral Medium 3.5 open Mistral Large 3 reported Gemma 3 27B open
MRCR i % accuracy 74 84.9

Human preference

Aggregate real-user preference from blind head-to-head comparisons — the closest thing to a 'do people actually like the answers' metric, and the one number vendors love to top.

Benchmark Fable 5 Opus 4.8 reported Opus 4.8 Mythos 5 GPT-5.5 Gemini 3.1 Pro Grok 4.3 DeepSeek-V4-Pro open Qwen3.7-Max Kimi K2.6 open GLM-5.2 open Llama 4 Maverick open Mistral Medium 3.5 open Mistral Large 3 reported Gemma 3 27B open
LMArena Elo i Elo 1458 1474 1466 1417 1338

Showing 13 flagship models across the headline benchmarks; 53 models and 84 benchmarks tracked in total from 60 primary & aggregator sources. Claude Mythos 5 numbers are limited (access-restricted preview). Numbers are published facts, restated with a per-cell source link; vendor benchmark charts are linked to their source, not rehosted.
Sources include: 9to5Google (Gemini 3 Flash launch coverage) · AIFire (citing OpenAI GPT-5.2 release) · Artificial Analysis · BenchLM.ai (citing LMArena) · BinaryVerse AI (xAI official figures) · BuildFastWithAI (citing OpenAI GPT-5.5 launch) · Caylent · Codersera (reporting Moonshot's figures) · DataCamp · DataCamp (reproducing Meta's Llama 4 launch chart) · DeepSeek-AI (arXiv 2512.02556) · DeepSeek-AI (Hugging Face model card) · Google (official Gemini 2.5 launch blog) · Google (official Gemini 3 Flash launch blog) · Google (official Gemini 3 launch blog) · Google DeepMind (Gemini 3.1 Pro model card) · +36 more.

Vendor-reported benchmarks

Numbers as claimed by the vendor on their own model/system card — not independently verified and often measured with a favourable harness. We track each vendor's claims over time and link to the source; cross-check against the cited matrix above.

Claude Opus 4.8 Anthropic
SWE-bench Verified 88.6%
SWE-bench Pro 69.2%
Terminal-Bench 2.1 · Terminus-2 public harness 74.6%
GPQA Diamond 93.6%
Humanity's Last Exam · without tools 49.8%
Humanity's Last Exam · with tools 57.9%
OSWorld-Verified 83.4%
Online-Mind2Web · browser agent benchmark 84%
USAMO 2026 · Olympic-level mathematical proofs 96.7%
MCP-Atlas · multi-step tool-calling 82.2%
GDPval-AA · economically valuable knowledge work 1890 Elo
vendor card ↗
GPT-5.5 OpenAI
Terminal-Bench 2.0 82.7%
SWE-Bench Pro 58.6%
SWE-Bench Verified 88.7%
GDPval · 44 occupations 84.9%
OSWorld-Verified 78.7%
ARC-AGI-2 · Verified 85.0%
FrontierMath Tier 4 35.4%
Expert-SWE · internal frontier eval 73.1%
MRCR v2 (1M tokens) · long-context retrieval at 512K-1M tokens 74.0%
CyberGym 81.8%
Tau2-Bench Telecom · without prompt tuning 98.0%
Humanity's Last Exam (no tools) 41.4%
vendor card ↗
Grok 4.3 xAI
GPQA Diamond · Graduate-level science reasoning; from Artificial Analysis and multiple sources 90.1%
Tau-Bench (τ²-Bench) · Tool-use and agentic benchmark 97.7%
GDPval-AA · Agentic task performance; xAI-reported improvement of 321 points from Grok 4.20 1500 Elo
Artificial Analysis Intelligence Index · High reasoning mode on v4.1; composite of 9 evaluations 38 index
SciCode · Code generation and problem-solving 47.3%
vendor card ↗
Gemini 3.1 Pro Google
GPQA Diamond 94.3%
ARC-AGI-2 77.1%
MATH 95.1%
SWE-Bench Verified 80.6%
Humanity's Last Exam 44.4%
Terminal-Bench 2.0 68.5%
LiveCodeBench Pro 2887 Elo
BrowseComp 85.9%
MCP Atlas 69.2%
τ-Bench Telecom 99.3%
APEX-Agents 33.5%
MMMLU 92.6%
vendor card ↗
Qwen3.7-Max Alibaba
GPQA Diamond 92.4%
SWE-Bench Pro 60.6 pass@1
Terminal-Bench 2.0-Terminus 69.7 %
Humanity's Last Exam (HLE) 41.4%
HMMT 2026 Feb 97.1 %
IMOAnswerBench 90 %
Apex 44.5 %
MCP-Atlas 76.4 %
MCP-Mark 60.8 %
SWE-Bench Verified 80.4 pass@1
LiveCodeBench 91.6 %
SpreadSheetBench-v1 87 %
vendor card ↗
DeepSeek-V4-Pro DeepSeek
GPQA Diamond · V4-Pro-Max 90.1%
SWE-bench Verified · V4-Pro-Max 80.6%
MMLU-Pro · V4-Pro-Max 87.5%
LiveCodeBench · V4-Pro-Max, Pass@1 93.5%
Humanity's Last Exam · V4-Pro-Max, Pass@1 37.7%
AIME 2025 · V4-Pro-Max, Pass@1 on HMMT 2026 Feb 95.2%
Codeforces · V4-Pro-Max 3206 Rating
Terminal-Bench 2.0 · V4-Pro-Max 67.9%
SimpleQA-Verified · V4-Pro-Max 57.9%
SWE-Bench Multilingual · V4-Pro-Max 76.2%
IMOAnswerBench · V4-Pro-Max 89.8%
vendor card ↗
Llama 4 Maverick Meta
MMLU Pro 80.5%
GPQA Diamond 69.8%
LiveCodeBench · averaged over multiple generations 43.4 pass@1
HumanEval 86.4%
Multilingual MMLU 84.6%
GSM8K 95.2%
MATH-500 85.3%
SWE-bench Verified 74.2%
vendor card ↗
Mistral Large 3 Mistral AI
MMLU-Pro · Independent evaluation via LayerLens/Atlas 73.11%
MATH-500 · Independent evaluation via LayerLens/Atlas 93.60%
HumanEval · Python code generation 90.24%
AGIEval English · Academic multiple-choice knowledge 74.00%
vendor card ↗
Kimi K2.6 Moonshot AI
SWE-Bench Pro · Thinking mode enabled 58.6%
SWE-Bench Verified · Thinking mode enabled 80.2%
Humanity's Last Exam (with tools) · Thinking mode enabled 54.0%
AIME 2026 · Thinking mode enabled, avg@32 96.4%
GPQA-Diamond · Thinking mode enabled, avg@8 90.5%
LiveCodeBench v6 · Thinking mode enabled 89.6%
Terminal-Bench 2.0 · Terminus-2 framework, thinking mode 66.7%
BrowseComp · Standard mode 83.2%
BrowseComp (Agent Swarm) · Agent Swarm mode 86.3%
DeepSearchQA (F1) · F1 score metric 92.5%
Toolathlon · Thinking mode enabled 50.0%
vendor card ↗
GLM-5.2 Zhipu AI (Z.ai)
SWE-bench Pro · 400K context, temp=1, top_p=1, max_tokens=32k 62.1 %
Terminal-Bench 2.1 · Claude Code 2.1.167, temp=1.0, top_p=0.95, 5 runs averaged 81.0 %
FrontierSWE (Dominance) · 1M context, max effort level, 128K max output tokens 74.4 %
PostTrainBench · 1M context, max effort level, 128K max output tokens 34.3 %
SWE-Marathon · 1M context, max effort level, 128K max output tokens 13.0 %
MCP-Atlas · Public set, think mode, 500-task subset, 10-min timeout 76.8 %
Tool-Decathlon · Official evaluation service, max_token=128K 48.2 %
AIME 2026 · Math competition benchmark 99.2 %
GPQA-Diamond · Graduate-level science Q&A 91.2 %
Humanity's Last Exam (with tools) · With external tools enabled 54.7 %
NL2Repo · Repository generation task, 400K context 58.2 %
DeepSWE · Official evaluation framework, 400K context 46.2 %
vendor card ↗

Model changes

New releases, deprecations, and benchmark score moves we’ve recorded — newest first.