AI Model Benchmarks Jun 2026

18 benchmarks - the world's most-followed benchmarks, curated by AI Explained, author of SimpleBench

Independently-run benchmarks by Epoch, Scale and others, so may not match self-reported scores by AI orgs.

Compare Models

Humanity's Last Exam

2,500 of the toughest, subject-diverse, multi-modal questions designed to test for both depth of reasoning and breadth of knowledge. Created in partnership with the Center for AI Safety, HLE includes questions across mathematics, humanities, and natural sciences from nearly 1,000 expert contributors.
Models (no tools)Score
1Gemini 3.1 Pro Preview (high thinking)46.4% ±2.0
2GPT-5.4 Pro44.3% ±2.0
3Muse Spark40.6% ±1.9
4Gemini 3 Pro Preview37.5% ±1.9
5GPT-5.4 (xhigh)36.2% ±1.9
Full Results

SimpleBench

Asks "trick" questions that require common-sense reasoning rather than memorized facts. Models must avoid being misled by common traps.
ModelScore
1Claude Fable 581.9%
2Gemini 3.1 Pro Preview79.6%
3GPT-5.5 Pro76.9%
4Gemini 3.5 Flash76.7%
5Gemini 3 Pro Preview76.4%
Full Results

METR Time Horizons

METR's time horizon is the human task duration at which an AI model reaches 50% success. Tasks are drawn from RE-Bench, HCAST and SWAA, which cover machine-learning research engineering, general software engineering and software operations respectively.
ModelMinutes
1Claude Mythos Preview1044.8
2Claude Opus 4.6 (unknown thinking)718.8
3Gemini 3.1 Pro Preview384.1
4GPT-5.2 (high)352.2
5GPT-5.3 Codex349.5
Full Results

SWE-bench Verified

A human-curated subset of 500 GitHub issues from the SWE-bench dataset tests whether models can implement valid code fixes. The model interacts with a Python repository and must modify the correct files to fix the issue. The solution is judged by running unit tests.
ModelScore
1Claude Opus 4.7 (max)83.5% ±1.7
2GPT-5.5 (xhigh)80.6% ±1.8
3Gemini 3.5 Flash (high)79.3% ±1.8
4Claude Opus 4.6 (no thinking)78.7% ±1.9
5GPT-5.4 (high)76.9% ±1.9
Full Results

GPQA Diamond

A multiple-choice set of 198 PhD-level science questions in biology, chemistry and physics. It focuses on "Diamond" items for which domain experts answered correctly while non-experts often failed; random guessing yields ~25%.
ModelScore
1GPT-5.4 Pro (xhigh)94.6% ±1.6
2Gemini 3.1 Pro Preview94.1% ±1.7
3GPT-5.5 (xhigh)94.0% ±1.5
4GPT-5.5 Pro (xhigh)93.9% ±1.6
5GPT-5.4 (xhigh)93.3% ±1.8
Full Results

GDPval

GDPval is a new OpenAI-led benchmark spanning 44 knowledge work occupations, selected from the top 9 industries contributing to U.S. GDP, from software developers and lawyers to registered nurses and mechanical engineers. These occupations represent the types of day-to-day work where AI can meaningfully assist professionals.
ModelScore
1GPT-5.249.7%
2Claude Opus 4.545.5%
3Claude Opus 4.143.6%
4Claude Sonnet 4.542.5%
5Gemini 3 Pro Preview40.3%
Full Results

Text Arena (Coding)

Previously known as WebDev Arena, this benchmark pits models against each other to build websites or web apps from prompts. Voters choose the better result, and scores are calculated with a Bradley-Terry model.
ModelScore
1Claude Opus 4.71566.9
2Claude Opus 4.61556.3
3Claude Opus 4.81552.2
4Qwen3.7-Max1540.8
5GLM-5.11534.0
Full Results

GSO (General Speedup Optimization)

Assesses models' ability to optimize software performance. Each task requires making code changes within a limited number of attempts. Performance is measured using OPT@K - the percentage of tasks where the model achieves at least 95% of the human-achieved speedup.
ModelScore
1Claude Opus 4.744.1%
2Claude Opus 4.6 (high)41.2%
3GPT-5.5 (xhigh)40.2%
4GPT-5.4 (xhigh)31.4%
5GPT-5.2 (high)27.4%
Full Results

Fiction.liveBench

Models must answer questions about long serialized stories hosted on Fiction.live. Questions probe recall of events, characters and chronological order. The benchmark emphasizes long-context comprehension.
ModelScore
1o3 (medium)100.0%
2GPT-5 (medium)96.9%
3Grok 496.9%
4Gemini 2.5 Pro Exp (Mar '25)90.6%
5o3-pro88.9%
Full Results

BALROG

Evaluates models on text-based games of varying difficulty (e.g., "Hades", "Keplar"). Scores reflect the percentage of games completed successfully; error bars come from repeated runs.
ModelScore
1Gemini 3 Pro Preview58.1% ±2.1
2Gemini 3.1 Pro Preview57.0% ±2.0
3Gemini 3 Flash48.1% ±2.4
4Grok 443.6% ±2.2
5Claude Opus 4.543.5% ±2.3
Full Results

OTIS Mock AIME 2024-25

This benchmark uses 45 integer-answer problems from unofficial Mock AIME exams (2024-2025). Problems are harder than MATH Level 5 but easier than FrontierMath and have answers between 0 and 999.
ModelScore
1GPT-5.5 Pro (xhigh)100.0% ±0.0
2GPT-5.5 (xhigh)100.0% ±0.0
3Claude Fable 5 (max)99.7% ±0.3
4Claude Opus 4.898.3% ±1.4
5Claude Opus 4.7 (xhigh)97.8% ±2.2
Full Results

MATH Level 5

The Level 5 subset of the MATH dataset contains the hardest competition-style problems from AMC 10, AMC 12 and AIME. Answers are scored using a combination of normalized string match, symbolic equivalence and model-graded equivalence.
ModelScore
1GPT-5 (high)98.1% ±0.3
2GPT-5 (medium)97.9% ±0.3
3GPT-5 mini (high)97.8% ±0.3
4o4-mini (high)97.8% ±0.3
5o3 (high)97.8% ±0.3
Full Results

FrontierMath Tiers 1-3 (v2)

Expert-written unpublished mathematics problems covering advanced undergraduate through early-career research difficulty. Epoch released v2 on June 12, 2026 after correcting and removing problematic items.
ModelScore
1GPT-5.5 Pro (xhigh)87.7% ±1.9
2Claude Fable 5 (max)87.0% ±2.0
3GPT-5.5 (xhigh)85.3% ±2.1
4Claude Opus 4.880.0% ±2.4
5GPT-5.4 (xhigh)78.6% ±2.4
Full Results

FrontierMath Tier 4 (v2)

Research-level FrontierMath problems from Epoch AI. Tier 4 contains the hardest private problems in the v2 FrontierMath release.
ModelScore
1Claude Fable 5 (max)87.8% ±5.2
2GPT-5.5 Pro (xhigh)78.0% ±6.5
3AI co-mathematician75.6% ±6.7
4GPT-5.5 (xhigh)72.5% ±7.1
5Claude Opus 4.856.1% ±7.8
Full Results

WeirdML v2

Asks models to write code that trains machine-learning models to solve non-standard tasks (e.g., recognizing shapes, classifying digits, predicting chess outcomes). Models iterate on code, training and evaluating within a constrained environment.
ModelScore
1Claude Fable 5 (high)87.9%
2GPT-5.5 (xhigh)84.9%
3Claude Opus 4.8 (xhigh)82.9%
4GPT-5.3 Codex79.3%
5Claude Opus 4.6 (high)78.0%
Full Results

Terminal-Bench 2.0

Models use a terminal to complete assignments such as editing files, running commands and debugging code. The tasks come from various agent frameworks; performance is the success rate.
ModelScore
1Claude Opus 4.790.2% ±2.1
2GPT-5.584.7% ±2.1
3GPT-5.481.8% ±2.0
4Gemini 3.1 Pro Preview80.2% ±2.6
5Claude Opus 4.679.8% ±1.6
Full Results

VPCT (Visual Physics Comprehension Test)

Each problem shows an image of a ramp with buckets; the model must predict in which bucket a ball will land. Tasks test basic understanding of gravity and motion.
ModelScore
1Gemini 3 Pro Preview91.0%
2GPT-5.2 (xhigh)84.0%
3Gemini 3 Flash72.6%
4GPT-5.2 (high)67.0%
5GPT-5 (high)66.0%
Full Results

GeoBench

Inspired by GeoGuessr. For each of 100 street-level photos (sampled from five community maps), models must guess the country and the precise latitude/longitude. Performance is measured using geographic distance and country accuracy; tasks require visual recognition, text extraction and geographic reasoning.
ModelScore
1Gemini 3 Pro Preview3893
2Gemini 2.5 Pro Preview (May '25)3836
3o3 (high)3789
4Gemini 2.0 Flash (Feb '25)3659
5GPT-5 (medium)3498
Full Results