AI Model Benchmarks Mar 2026

18 benchmarks - the world's most-followed benchmarks, curated by AI Explained, author of SimpleBench

Independently-run benchmarks by Epoch, Scale and others, so may not match self-reported scores by AI orgs.

Compare Models

Humanity's Last Exam

2,500 of the toughest, subject-diverse, multi-modal questions designed to test for both depth of reasoning and breadth of knowledge. Created in partnership with the Center for AI Safety, HLE includes questions across mathematics, humanities, and natural sciences from nearly 1,000 expert contributors.
Models (no tools)Score
1Gemini 3 Pro Preview37.52% ±1.90
2Claude Opus 4.6 (max)34.44% ±1.86
3GPT-5 Pro31.64% ±1.82
4GPT-5.227.80% ±1.76
5GPT-5 (August '25)25.32% ±1.70
Full Results

SimpleBench

Asks "trick" questions that require common-sense reasoning rather than memorized facts. Models must avoid being misled by common traps.
ModelScore
1Gemini 3.1 Pro Preview79.6%
2Gemini 3 Pro Preview76.4%
3GPT-5.4 Pro74.1%
4Claude Opus 4.667.6%
5Gemini 2.5 Pro (06-05)62.4%
Full Results

METR Time Horizons

METR's time horizon is the human task duration at which an AI model reaches 50% success. Tasks are drawn from RE-Bench, HCAST and SWAA, which cover machine-learning research engineering, general software engineering and software operations respectively.
ModelMinutes
1Claude Opus 4.5 (16k thinking)288.9 ±558.2
2GPT-5 (medium)137.3 ±102.1
3Claude Sonnet 4.5113.3 ±91.4
4Grok 4110.1 ±91.8
5Claude Opus 4.1105.5 ±69.2
Full Results

SWE-bench Verified

A human-curated subset of 500 GitHub issues from the SWE-bench dataset tests whether models can implement valid code fixes. The model interacts with a Python repository and must modify the correct files to fix the issue. The solution is judged by running unit tests.
ModelScore
1Claude Opus 4.678.7% ±1.9
2GPT-5.4 (high)76.9% ±1.9
3Claude Opus 4.576.7% ±1.9
4Gemini 3.1 Pro Preview75.6% ±2.0
5Gemini 3 Flash75.4% ±2.0
Full Results

GPQA Diamond

A multiple-choice set of 198 PhD-level science questions in biology, chemistry and physics. It focuses on "Diamond" items for which domain experts answered correctly while non-experts often failed; random guessing yields ~25%.
ModelScore
1Gemini 3.1 Pro Preview94.1% ±1.7
2Gemini 3 Pro Preview92.6% ±1.7
3GPT-5.2 (xhigh)91.4% ±1.8
4Claude Opus 4.6 (32k thinking)90.5% ±1.7
5Claude Opus 4.6 (64k thinking)88.8% ±1.9
Full Results

GDPval

GDPval is a new OpenAI-led benchmark spanning 44 knowledge work occupations, selected from the top 9 industries contributing to U.S. GDP, from software developers and lawyers to registered nurses and mechanical engineers. These occupations represent the types of day-to-day work where AI can meaningfully assist professionals.
ModelScore
1GPT-5.483.0%
2GPT-5.3 Codex70.9%
3GPT-5.270.9%
4Claude Opus 4.559.6%
5Gemini 3 Pro Preview53.5%
Full Results

GSO (General Speedup Optimization)

Assesses models' ability to optimize software performance. Each task requires making code changes within a limited number of attempts. Performance is measured using OPT@K - the percentage of tasks where the model achieves at least 95% of the human-achieved speedup.
ModelScore
1GPT-5.2 (high)27.4%
2Claude Opus 4.5 (no-thinking)26.5%
3Gemini 3 Pro Preview18.6%
4Gemini 3 Flash9.8%
5o3 (high)8.8%
Full Results

Fiction.liveBench

Models must answer questions about long serialized stories hosted on Fiction.live. Questions probe recall of events, characters and chronological order. The benchmark emphasizes long-context comprehension.
ModelScore
1o3 (medium)100.0%
2Grok 496.9%
3GPT-5 (medium)96.9%
4Gemini 2.5 Pro Exp (Mar '25)90.6%
5o3-pro88.9%
Full Results

WebDev Arena

Pits two models against each other to build the best website for a given prompt. Voters rate which site looks nicer and functions better. The results are combined using the Bradley-Terry model, producing a score for each model.
ModelScore
1Claude Opus 4.5 (32k thinking)1512
2GPT-5.2 (high)1480
3Claude Opus 4.5 (no-thinking)1479
4GPT-5 (high)1477.5
5Claude Opus 4.11472.4
Full Results

BALROG

Evaluates models on text-based games of varying difficulty (e.g., "Hades", "Keplar"). Scores reflect the percentage of games completed successfully; error bars come from repeated runs.
ModelScore
1Gemini 3 Flash48.1% ±2.4
2Grok 443.6% ±2.2
3Gemini 2.5 Pro Exp (Mar '25)40.4% ±2.3
4DeepSeek-R134.9% ±2.2
5Gemini 2.5 Flash33.5% ±2.1
Full Results

OTIS Mock AIME 2024-25

This benchmark uses 45 integer-answer problems from unofficial Mock AIME exams (2024-2025). Problems are harder than MATH Level 5 but easier than FrontierMath and have answers between 0 and 999.
ModelScore
1GPT-5.2 (xhigh)96.1% ±2.7
2GPT-5.2 (high)96.1% ±2.6
3Gemini 3.1 Pro Preview95.6% ±3.1
4Claude Opus 4.6 (64k thinking)94.4% ±2.8
5GPT-5.2 (medium)93.9% ±3.1
Full Results

MATH Level 5

The Level 5 subset of the MATH dataset contains the hardest competition-style problems from AMC 10, AMC 12 and AIME. Answers are scored using a combination of normalized string match, symbolic equivalence and model-graded equivalence.
ModelScore
1GPT-5 (high)98.1% ±0.3
2GPT-5 (medium)97.9% ±0.3
3o4-mini (high)97.8% ±0.3
4o3 (high)97.8% ±0.3
5Claude Sonnet 4.597.7% ±0.4
Full Results

FrontierMath

Contains several hundred unpublished expert-level mathematics problems spanning undergraduate through research-level mathematics. Problems are open-ended and solutions are graded manually with a rubric.
ModelScore
1GPT-5.4 Pro (xhigh)50.0% ±2.9
2GPT-5.4 (xhigh)47.6% ±2.9
3Claude Opus 4.6 (max)40.7% ±2.9
4GPT-5.2 (xhigh)40.7% ±2.9
5GPT-5.2 (high)40.3% ±2.9
Full Results

WeirdML v2

Asks models to write code that trains machine-learning models to solve non-standard tasks (e.g., recognizing shapes, classifying digits, predicting chess outcomes). Models iterate on code, training and evaluating within a constrained environment.
ModelScore
1GPT-5.3 Codex79.3%
2GPT-5.2 (xhigh)72.2%
3Gemini 3.1 Pro Preview72.1%
4Gemini 3 Pro Preview69.9%
5Claude Opus 4.6 (no thinking)65.9%
Full Results

Terminal-Bench 2.0

Models use a terminal to complete assignments such as editing files, running commands and debugging code. The tasks come from various agent frameworks; performance is the success rate.
ModelScore
1Gemini 3.1 Pro Preview78.4%
2GPT-5.3 Codex77.3%
3GPT-5.3 Codex75.1%
4Claude Opus 4.6 (no thinking)69.9%
5GPT-5.2 (medium)64.9%
Full Results

VPCT (Visual Physics Comprehension Test)

Each problem shows an image of a ramp with buckets; the model must predict in which bucket a ball will land. Tasks test basic understanding of gravity and motion.
ModelScore
1Gemini 3 Pro Preview91.0%
2GPT-5.2 (xhigh)84.0%
3Gemini 3 Flash72.6%
4GPT-5.2 (high)67.0%
5GPT-5 (high)66.0%
Full Results

Factorio Learning Environment

Agents must control Factorio to build factories. Tasks include a lab-play mode (three subtasks, e.g., crafting 100 transport belts) and an open-play mode requiring construction of a functioning factory.
ModelScore
1Claude 3.7 Sonnet29.1
2Claude 3.5 Sonnet (Jun '24)28.1
3Gemini 2.5 Pro Exp (Mar '25)18.4
4GPT-4o (Nov '24)16.6
5DeepSeek-V315.1
Full Results

GeoBench

Inspired by GeoGuessr. For each of 100 street-level photos (sampled from five community maps), models must guess the country and the precise latitude/longitude. Performance is measured using geographic distance and country accuracy; tasks require visual recognition, text extraction and geographic reasoning.
ModelScore
1Gemini 3 Pro Preview3893
2Gemini 2.0 Flash Thinking Exp3873
3Gemini 2.5 Pro Exp (Mar '25)3871
4Gemini 2.5 Pro Preview (Jun '25)3836
5o3 (high)3789
Full Results