Models

Frontier & open-weight models compared by capability — columns are flagship models, rows are the headline benchmarks that measure each one. Every number is the published figure, restated with a link to the source it came from · as of 2026-06-14. Curated matrix refreshed 20d ago; live leaderboards auto-track on the 6-hour cron.

We cite the benchmark authorities rather than re-rank them — Epoch AI, LMArena and Artificial Analysis. ChangeRadar's job is the changes: what gets released, repriced, deprecated, or quietly shifts behind a stable model id.

Vendor champions · best flagship per vendor · GPQA Diamond

Google

Gemini 3.1 Pro

94.3GPQA Diamond

Anthropic reported

Claude Opus 4.8

93.6GPQA Diamond

Source ↗

OpenAI

GPT-5.5

93.6GPQA Diamond

Mistral AI reported

Mistral Large 3

93.6MATH

Source ↗

Alibaba

Qwen3.7-Max

92.4GPQA Diamond

Zhipu AI (Z.ai) open

GLM-5.2

91.2GPQA Diamond

Moonshot AI open

Kimi K2.6

90.5GPQA Diamond

xAI

Grok 4.3

90.1GPQA Diamond

DeepSeek open

DeepSeek-V4-Pro

90.1GPQA Diamond

Meta open

Llama 4 Maverick

69.8GPQA Diamond

Each vendor's most recent publicly-available flagship, ordered by score (highest first) — Google · OpenAI · Anthropic · xAI highlighted. The weekly market-watch surfaces new releases automatically; one tagged reported is the latest release shown with vendor-reported scores (linked to source) until we independently cite it. Score = GPQA Diamond; every number links to its source.

Compare two models

2 wins · Claude Fable 5 1 wins · Gemini 3.1 Pro 0 ties

Agentic coding

Benchmark	Claude Fable 5	Gemini 3.1 Pro	Δ
SWE-bench Verified i% resolved (pass@1)	95	80.6	+14.4
SWE-bench Pro i% resolved (pass@1)	80.3	54.2	+26.1

3 coverage gaps — only one model reports these

Terminal-Bench — Gemini 3.1 Pro 68.5
LiveCodeBench — Gemini 3.1 Pro 2887
FrontierCode — Claude Fable 5 29.3

Tool use & agents

5 coverage gaps — only one model reports these

TAU-bench — Gemini 3.1 Pro 99.3
OSWorld — Claude Fable 5 85
BrowseComp — Gemini 3.1 Pro 85.9
GDPval-AA — Claude Fable 5 1932
MCP Atlas — Gemini 3.1 Pro 69.2

Science & reasoning

Benchmark	Claude Fable 5	Gemini 3.1 Pro	Δ
GPQA Diamond i% accuracy	92.6	94.3	-1.7

2 coverage gaps — only one model reports these

Humanity's Last Exam — Gemini 3.1 Pro 44.4
ARC-AGI-2 — Gemini 3.1 Pro 77.1

General knowledge

1 coverage gap — only one model reports these

MMMLU — Gemini 3.1 Pro 92.6

Multimodal

2 coverage gaps — only one model reports these

MMMU — Gemini 3.1 Pro 80.5
GDP.pdf — Claude Fable 5 29.8

Long context

1 coverage gap — only one model reports these

MRCR — Gemini 3.1 Pro 84.9

Shared benchmarks first (with Δ when both report the same scale); one-sided coverage collapses below. Pick any two models — or a champion above — and the URL becomes shareable.

All models · benchmark matrix

15 of 15 columns shown

reported columns (Claude Opus 4.8, Mistral Large 3) are auto-discovered by our weekly market-watch from each vendor's own reported numbers — not independently verified, and shown when a vendor ships a model newer than the hand-cited column beside it. Full claim sets are in Vendor-reported benchmarks below.

Agentic coding

Can the model fix real bugs, ship features, and operate a dev environment end-to-end as a coding agent — the single most-watched capability in 2026 vendor launches.

Benchmark	Fable 5	Opus 4.8 reported	Opus 4.8	Mythos 5	GPT-5.5	Gemini 3.1 Pro	Grok 4.3	DeepSeek-V4-Pro open	Qwen3.7-Max	Kimi K2.6 open	GLM-5.2 open	Llama 4 Maverick open	Mistral Medium 3.5 open	Mistral Large 3 reported	Gemma 3 27B open
SWE-bench Verified i % resolved (pass@1)	95	88.6	88.6	—	—	80.6	—	80.6	80.4	80.2	—	—	77.6	—	—
SWE-bench Pro i % resolved (pass@1)	80.3	69.2	69.2	77.8	58.6	54.2	—	—	60.6	58.6	62.1	—	—	—	—
Terminal-Bench i % solved (pass@1)	—	74.6	74.6	88	82.7	68.5	—	67.9	69.7	66.7	81	—	—	—	—
LiveCodeBench i % pass@1	—	—	—	—	—	2887	—	93.5	—	89.6	—	43.4	—	—	29.7

Tool use & agents

Beyond writing code: can the model select and chain tools, follow policy, drive a computer/browser, and complete long-horizon multi-step tasks.

Benchmark	Fable 5	Opus 4.8 reported	Opus 4.8	Mythos 5	GPT-5.5	Gemini 3.1 Pro	Grok 4.3	DeepSeek-V4-Pro open	Qwen3.7-Max	Kimi K2.6 open	GLM-5.2 open	Llama 4 Maverick open	Mistral Medium 3.5 open	Mistral Large 3 reported	Gemma 3 27B open
TAU-bench i % pass / pass^k	—	—	—	—	—	99.3	98	—	—	—	—	—	—	—	—
OSWorld i % success	85	83.4	83.4	—	78.7	—	—	—	—	73.1	—	—	—	—	—
BrowseComp i % accuracy	—	—	84.3	—	90.1	85.9	—	—	—	83.2	—	—	—	—	—

Math

Competition and research-level mathematical reasoning, increasingly reported on uncontaminated/post-cutoff problem sets.

Benchmark	Fable 5	Opus 4.8 reported	Opus 4.8	Mythos 5	GPT-5.5	Gemini 3.1 Pro	Grok 4.3	DeepSeek-V4-Pro open	Qwen3.7-Max	Kimi K2.6 open	GLM-5.2 open	Llama 4 Maverick open	Mistral Medium 3.5 open	Mistral Large 3 reported	Gemma 3 27B open
HMMT i % accuracy (pass@1)	—	—	—	—	—	—	—	—	97.1	92.7	—	—	—	—	—
MATH i % accuracy	—	—	—	—	—	—	—	—	—	—	—	61.2	—	93.6	89

Science & reasoning

Expert-level, Google-proof reasoning across the sciences and broad academia — the benchmarks vendors point to when claiming 'PhD-level' or 'frontier' reasoning.

Benchmark	Fable 5	Opus 4.8 reported	Opus 4.8	Mythos 5	GPT-5.5	Gemini 3.1 Pro	Grok 4.3	DeepSeek-V4-Pro open	Qwen3.7-Max	Kimi K2.6 open	GLM-5.2 open	Llama 4 Maverick open	Mistral Medium 3.5 open	Mistral Large 3 reported	Gemma 3 27B open
GPQA Diamond i % accuracy	92.6	93.6	93.6	—	93.6	94.3	90.1	90.1	92.4	90.5	91.2	69.8	—	—	42.4
Humanity's Last Exam i % accuracy	—	49.8	57.9	64.5	57.2	44.4	—	37.7	41.4	54	40.5	—	—	—	—
ARC-AGI-2 i % solved	—	—	—	—	85	77.1	—	—	—	—	—	—	—	—	—

General knowledge

Broad multi-subject factual and reasoning coverage; the classic 'how much does it know' bucket, now reported via the harder Pro variant since base MMLU is saturated.

Benchmark	Fable 5	Opus 4.8 reported	Opus 4.8	Mythos 5	GPT-5.5	Gemini 3.1 Pro	Grok 4.3	DeepSeek-V4-Pro open	Qwen3.7-Max	Kimi K2.6 open	GLM-5.2 open	Llama 4 Maverick open	Mistral Medium 3.5 open	Mistral Large 3 reported	Gemma 3 27B open
MMLU-Pro i % accuracy	—	—	—	—	—	—	—	87.5	—	—	—	80.5	—	73.11	67.5

Multimodal

Vision + language understanding and visual reasoning — how well the model interprets images, diagrams, charts and figures.

Benchmark	Fable 5	Opus 4.8 reported	Opus 4.8	Mythos 5	GPT-5.5	Gemini 3.1 Pro	Grok 4.3	DeepSeek-V4-Pro open	Qwen3.7-Max	Kimi K2.6 open	GLM-5.2 open	Llama 4 Maverick open	Mistral Medium 3.5 open	Mistral Large 3 reported	Gemma 3 27B open
MMMU i % accuracy	—	—	—	—	83.2	80.5	—	—	—	79.4	—	73.4	—	—	64.9

Long context

Retrieval and reasoning quality as context length grows into the hundreds-of-thousands / millions of tokens — beyond simple needle-in-a-haystack.

Benchmark	Fable 5	Opus 4.8 reported	Opus 4.8	Mythos 5	GPT-5.5	Gemini 3.1 Pro	Grok 4.3	DeepSeek-V4-Pro open	Qwen3.7-Max	Kimi K2.6 open	GLM-5.2 open	Llama 4 Maverick open	Mistral Medium 3.5 open	Mistral Large 3 reported	Gemma 3 27B open
MRCR i % accuracy	—	—	—	—	74	84.9	—	—	—	—	—	—	—	—	—

Human preference

Aggregate real-user preference from blind head-to-head comparisons — the closest thing to a 'do people actually like the answers' metric, and the one number vendors love to top.

Benchmark	Fable 5	Opus 4.8 reported	Opus 4.8	Mythos 5	GPT-5.5	Gemini 3.1 Pro	Grok 4.3	DeepSeek-V4-Pro open	Qwen3.7-Max	Kimi K2.6 open	GLM-5.2 open	Llama 4 Maverick open	Mistral Medium 3.5 open	Mistral Large 3 reported	Gemma 3 27B open
LMArena Elo i Elo	—	—	—	1458	1474	—	—	—	—	1466	—	1417	—	—	1338

Showing 13 flagship models across the headline benchmarks; 53 models and 84 benchmarks tracked in total from 60 primary & aggregator sources. Claude Mythos 5 numbers are limited (access-restricted preview). Numbers are published facts, restated with a per-cell source link; vendor benchmark charts are linked to their source, not rehosted.
Sources include: 9to5Google (Gemini 3 Flash launch coverage) · AIFire (citing OpenAI GPT-5.2 release) · Artificial Analysis · BenchLM.ai (citing LMArena) · BinaryVerse AI (xAI official figures) · BuildFastWithAI (citing OpenAI GPT-5.5 launch) · Caylent · Codersera (reporting Moonshot's figures) · DataCamp · DataCamp (reproducing Meta's Llama 4 launch chart) · DeepSeek-AI (arXiv 2512.02556) · DeepSeek-AI (Hugging Face model card) · Google (official Gemini 2.5 launch blog) · Google (official Gemini 3 Flash launch blog) · Google (official Gemini 3 launch blog) · Google DeepMind (Gemini 3.1 Pro model card) · +36 more.

Vendor-reported benchmarks

Numbers as claimed by the vendor on their own model/system card — not independently verified and often measured with a favourable harness. We track each vendor's claims over time and link to the source; cross-check against the cited matrix above.

Claude Opus 4.8 Anthropic

SWE-bench Verified	88.6%
SWE-bench Pro	69.2%
Terminal-Bench 2.1 · Terminus-2 public harness	74.6%
GPQA Diamond	93.6%
Humanity's Last Exam · without tools	49.8%
Humanity's Last Exam · with tools	57.9%
OSWorld-Verified	83.4%
Online-Mind2Web · browser agent benchmark	84%
USAMO 2026 · Olympic-level mathematical proofs	96.7%
MCP-Atlas · multi-step tool-calling	82.2%
GDPval-AA · economically valuable knowledge work	1890 Elo

vendor card ↗

GPT-5.5 OpenAI

Terminal-Bench 2.0	82.7%
SWE-Bench Pro	58.6%
SWE-Bench Verified	88.7%
GDPval · 44 occupations	84.9%
OSWorld-Verified	78.7%
ARC-AGI-2 · Verified	85.0%
FrontierMath Tier 4	35.4%
Expert-SWE · internal frontier eval	73.1%
MRCR v2 (1M tokens) · long-context retrieval at 512K-1M tokens	74.0%
CyberGym	81.8%
Tau2-Bench Telecom · without prompt tuning	98.0%
Humanity's Last Exam (no tools)	41.4%

vendor card ↗

Grok 4.3 xAI

GPQA Diamond · Graduate-level science reasoning; from Artificial Analysis and multiple sources	90.1%
Tau-Bench (τ²-Bench) · Tool-use and agentic benchmark	97.7%
GDPval-AA · Agentic task performance; xAI-reported improvement of 321 points from Grok 4.20	1500 Elo
Artificial Analysis Intelligence Index · High reasoning mode on v4.1; composite of 9 evaluations	38 index
SciCode · Code generation and problem-solving	47.3%

vendor card ↗

Gemini 3.1 Pro Google

GPQA Diamond	94.3%
ARC-AGI-2	77.1%
MATH	95.1%
SWE-Bench Verified	80.6%
Humanity's Last Exam	44.4%
Terminal-Bench 2.0	68.5%
LiveCodeBench Pro	2887 Elo
BrowseComp	85.9%
MCP Atlas	69.2%
τ-Bench Telecom	99.3%
APEX-Agents	33.5%
MMMLU	92.6%

vendor card ↗

Qwen3.7-Max Alibaba

GPQA Diamond	92.4%
SWE-Bench Pro	60.6 pass@1
Terminal-Bench 2.0-Terminus	69.7 %
Humanity's Last Exam (HLE)	41.4%
HMMT 2026 Feb	97.1 %
IMOAnswerBench	90 %
Apex	44.5 %
MCP-Atlas	76.4 %
MCP-Mark	60.8 %
SWE-Bench Verified	80.4 pass@1
LiveCodeBench	91.6 %
SpreadSheetBench-v1	87 %

vendor card ↗

DeepSeek-V4-Pro DeepSeek

GPQA Diamond · V4-Pro-Max	90.1%
SWE-bench Verified · V4-Pro-Max	80.6%
MMLU-Pro · V4-Pro-Max	87.5%
LiveCodeBench · V4-Pro-Max, Pass@1	93.5%
Humanity's Last Exam · V4-Pro-Max, Pass@1	37.7%
AIME 2025 · V4-Pro-Max, Pass@1 on HMMT 2026 Feb	95.2%
Codeforces · V4-Pro-Max	3206 Rating
Terminal-Bench 2.0 · V4-Pro-Max	67.9%
SimpleQA-Verified · V4-Pro-Max	57.9%
SWE-Bench Multilingual · V4-Pro-Max	76.2%
IMOAnswerBench · V4-Pro-Max	89.8%

vendor card ↗

Llama 4 Maverick Meta

MMLU Pro	80.5%
GPQA Diamond	69.8%
LiveCodeBench · averaged over multiple generations	43.4 pass@1
HumanEval	86.4%
Multilingual MMLU	84.6%
GSM8K	95.2%
MATH-500	85.3%
SWE-bench Verified	74.2%

vendor card ↗

Mistral Large 3 Mistral AI

MMLU-Pro · Independent evaluation via LayerLens/Atlas	73.11%
MATH-500 · Independent evaluation via LayerLens/Atlas	93.60%
HumanEval · Python code generation	90.24%
AGIEval English · Academic multiple-choice knowledge	74.00%

vendor card ↗

Kimi K2.6 Moonshot AI

SWE-Bench Pro · Thinking mode enabled	58.6%
SWE-Bench Verified · Thinking mode enabled	80.2%
Humanity's Last Exam (with tools) · Thinking mode enabled	54.0%
AIME 2026 · Thinking mode enabled, avg@32	96.4%
GPQA-Diamond · Thinking mode enabled, avg@8	90.5%
LiveCodeBench v6 · Thinking mode enabled	89.6%
Terminal-Bench 2.0 · Terminus-2 framework, thinking mode	66.7%
BrowseComp · Standard mode	83.2%
BrowseComp (Agent Swarm) · Agent Swarm mode	86.3%
DeepSearchQA (F1) · F1 score metric	92.5%
Toolathlon · Thinking mode enabled	50.0%

vendor card ↗

GLM-5.2 Zhipu AI (Z.ai)

SWE-bench Pro · 400K context, temp=1, top_p=1, max_tokens=32k	62.1 %
Terminal-Bench 2.1 · Claude Code 2.1.167, temp=1.0, top_p=0.95, 5 runs averaged	81.0 %
FrontierSWE (Dominance) · 1M context, max effort level, 128K max output tokens	74.4 %
PostTrainBench · 1M context, max effort level, 128K max output tokens	34.3 %
SWE-Marathon · 1M context, max effort level, 128K max output tokens	13.0 %
MCP-Atlas · Public set, think mode, 500-task subset, 10-min timeout	76.8 %
Tool-Decathlon · Official evaluation service, max_token=128K	48.2 %
AIME 2026 · Math competition benchmark	99.2 %
GPQA-Diamond · Graduate-level science Q&A	91.2 %
Humanity's Last Exam (with tools) · With external tools enabled	54.7 %
NL2Repo · Repository generation task, 400K context	58.2 %
DeepSWE · Official evaluation framework, 400K context	46.2 %

vendor card ↗

Model changes

New releases, deprecations, and benchmark score moves we’ve recorded — newest first.

Vendor claim xAI — reported benchmarks Jul 3, 2026

xAI reported benchmarks updated

Grok 4.3: 5 benchmark claims (via web search)
Benchmark LMArena (text) Jul 3, 2026

LMArena (text) scores changed

Arena Elo (text, overall) updated.
Model NVIDIA Nemotron Jul 3, 2026

NVIDIA Nemotron models changed

Added nvidia/MiniMax-M2.7-DFlash model with other license type, available until 2026-06-16.
Vendor claim Meta Llama — reported benchmarks Jul 3, 2026

Meta reported benchmarks updated

Llama 4 Maverick: 8 benchmark claims (via web search)
Silent update Epoch Capabilities Index Jul 3, 2026

deepseek-coder-1.3b-base: Epoch Capabilities Index (ECI) ↑ 63.596 → 63.813

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
Silent update Epoch Capabilities Index Jul 3, 2026

Cerebras-GPT-13B: Epoch Capabilities Index (ECI) ↑ 82.645 → 82.79

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
Silent update Epoch Capabilities Index Jul 3, 2026

starcoder2-3b: Epoch Capabilities Index (ECI) ↑ 88.205 → 88.329

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
Silent update Epoch Capabilities Index Jul 3, 2026

deepseek-coder-6.7b-base: Epoch Capabilities Index (ECI) ↑ 89.062 → 89.184

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
Silent update Epoch Capabilities Index Jul 3, 2026

dolly-v2-12b: Epoch Capabilities Index (ECI) ↑ 89.113 → 89.235

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
Silent update Epoch Capabilities Index Jul 3, 2026

Baichuan-7B: Epoch Capabilities Index (ECI) ↑ 89.891 → 90.006

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
Silent update Epoch Capabilities Index Jul 3, 2026

phi-1_5: Epoch Capabilities Index (ECI) ↑ 91.015 → 91.132

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
Silent update Epoch Capabilities Index Jul 3, 2026

xgen-7b-8k-base: Epoch Capabilities Index (ECI) ↑ 92.881 → 92.989

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
Silent update Epoch Capabilities Index Jul 3, 2026

starcoder2-7b: Epoch Capabilities Index (ECI) ↑ 93.025 → 93.132

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
Silent update Epoch Capabilities Index Jul 3, 2026

gemma-2b: Epoch Capabilities Index (ECI) ↑ 93.684 → 93.789

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
Silent update Epoch Capabilities Index Jul 3, 2026

mpt-7b: Epoch Capabilities Index (ECI) ↑ 94.078 → 94.181

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
Silent update Epoch Capabilities Index Jul 3, 2026

falcon-7b: Epoch Capabilities Index (ECI) ↑ 94.588 → 94.689

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
Silent update Epoch Capabilities Index Jul 3, 2026

deepseek-coder-33b-base: Epoch Capabilities Index (ECI) ↑ 95.816 → 95.912

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
Silent update Epoch Capabilities Index Jul 3, 2026

Baichuan-2-7B-Base: Epoch Capabilities Index (ECI) ↑ 95.829 → 95.924

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
Silent update Epoch Capabilities Index Jul 3, 2026

LLaMA-7B: Epoch Capabilities Index (ECI) ↑ 96.2 → 96.296

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
Silent update Epoch Capabilities Index Jul 3, 2026

Llama-2-7b: Epoch Capabilities Index (ECI) ↑ 98.591 → 98.678

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
Silent update Epoch Capabilities Index Jul 3, 2026

LLaMA-13B: Epoch Capabilities Index (ECI) ↑ 100.124 → 100.207

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
Silent update Epoch Capabilities Index Jul 3, 2026

mpt-30b: Epoch Capabilities Index (ECI) ↑ 100.202 → 100.284

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
Silent update Epoch Capabilities Index Jul 3, 2026

mpt-30b-instruct: Epoch Capabilities Index (ECI) ↑ 100.202 → 100.284

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
Silent update Epoch Capabilities Index Jul 3, 2026

INTELLECT-1-Instruct: Epoch Capabilities Index (ECI) ↑ 100.403 → 100.484

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.
Silent update Epoch Capabilities Index Jul 3, 2026

Qwen2.5-Coder-1.5B: Epoch Capabilities Index (ECI) ↑ 102.557 → 102.629

Same model id, score moved on Epoch Capabilities Index — a silent re-evaluation or model swap.