AI Model Benchmarks Nov 2025

20 benchmarks - the world's most-followed benchmarks, curated by AI Explained, author of SimpleBench

Independently-run benchmarks by Epoch, Scale and others, so may not match self-reported scores by AI orgs.

Compare Models

Humanity's Last Exam

Knowledge / Reasoning
2,500 of the toughest, subject-diverse, multi-modal questions designed to test for both depth of reasoning and breadth of knowledge. Created in partnership with the Center for AI Safety, HLE includes questions across mathematics, humanities, and natural sciences from nearly 1,000 expert contributors.
Models (no tools)Score
1GPT-5 (August '25)25.32% ±1.70
2Kimi K2 Thinking23.9% ±1.61
3Gemini 2.5 Pro Preview (June '25)21.64% ±1.61
4o3 (high) (April '25)20.32% ±1.58
5GPT-5 Mini (August '25)19.44% ±1.55
Full Results

SimpleBench

Reasoning / Common-sense
Asks "trick" questions that require common-sense reasoning rather than memorized facts. Models must avoid being misled by common traps.
ModelScore
1Gemini 2.5 Pro Preview (Jun '25)62.4%
2GPT-5 Pro61.6%
3Grok 460.5%
4Claude Opus 4.160.0%
5Claude Opus 458.8%
Full Results

METR Time Horizons

Software Engineering
METR's time horizon is the human task duration at which an AI model reaches 50% success. Tasks are drawn from RE-Bench, HCAST and SWAA, which cover machine-learning research engineering, general software engineering and software operations respectively.
ModelMinutes
1GPT-5 (medium)137.3 ±102.1
2Claude Sonnet 4.5113.3 ±91.4
3Grok 4110.1 ±91.8
4Claude Opus 4.1105.5 ±69.2
5o3 (medium)91.3 ±58.8
Full Results

SWE-bench Verified

Software Engineering
A human-curated subset of 500 GitHub issues from the SWE-bench dataset tests whether models can implement valid code fixes. The model interacts with a Python repository and must modify the correct files to fix the issue. The solution is judged by running unit tests.
ModelScore
1Claude Sonnet 4.564.8% ±2.1
2Claude Opus 4.163.2% ±2.2
3Claude Opus 462.2% ±2.2
4Claude Sonnet 460.6% ±2.2
5Claude Haiku 4.560.6% ±2.2
Full Results

GPQA Diamond

Science and Reasoning
A multiple-choice set of 198 PhD-level science questions in biology, chemistry and physics. It focuses on "Diamond" items for which domain experts answered correctly while non-experts often failed; random guessing yields ~25%.
ModelScore
1Grok 487.0% ±2.0
2GPT-5 (high)86.2% ±2.1
3GPT-5 (medium)85.4% ±2.1
4Gemini 2.5 Pro Preview (Jun '25)84.8% ±2.6
5Gemini 2.5 Pro Exp (Mar '25)83.8% ±2.6
Full Results

GDPval

Knowledge / Reasoning
GDPval is a new OpenAI-led benchmark spanning 44 knowledge work occupations, selected from the top 9 industries contributing to U.S. GDP, from software developers and lawyers to registered nurses and mechanical engineers. These occupations represent the types of day-to-day work where AI can meaningfully assist professionals.
ModelScore
1Claude Opus 4.143.6%
2GPT-5 (high)34.8%
3GPT-5 (medium)33.9%
4GPT-5 (low)31.9%
5o3 (high)30.8%
Full Results

GSO (General Speedup Optimization)

Software Engineering
Assesses models' ability to optimize software performance. Each task requires making code changes within a limited number of attempts. Performance is measured using OPT@K - the percentage of tasks where the model achieves at least 95% of the human-achieved speedup.
ModelScore
1o3 (high)8.8%
2Claude Opus 46.9%
3GPT-5 (high)6.9%
4Claude Sonnet 44.9%
5Kimi K24.9%
Full Results

DeepResearchBench

Search
Tasks require gathering and synthesizing information from the web. Models must plan search queries and extract data from static web snapshots, covering categories like "Finding numbers", "Compiling data" and "Causal inference".
ModelScore
1Claude Sonnet 4.557.7%
2GPT-5 (low)57.4%
3Claude Opus 4.156.4%
4Claude Opus 456.3%
5GPT-5 (medium)56.0%
Full Results

WebDev Arena

Multimodal
Pits two models against each other to build the best website for a given prompt. Voters rate which site looks nicer and functions better. The results are combined using the Bradley-Terry model, producing a score for each model.
ModelScore
1GPT-5 (high)1477.5
2Claude Opus 4.11472.4
3Claude Opus 4.11462.3
4Claude Sonnet 4.5 (32k thinking)1420.8
5Gemini 2.5 Pro1401.0
Full Results

BALROG

Gaming
Evaluates models on text-based games of varying difficulty (e.g., "Hades", "Keplar"). Scores reflect the percentage of games completed successfully; error bars come from repeated runs.
ModelScore
1Grok 443.6% ±2.2
2Gemini 2.5 Pro Exp (Mar '25)40.4% ±2.3
3DeepSeek-R134.9% ±2.2
4Gemini 2.5 Flash33.5% ±2.1
5GPT-5 (minimal)32.8% ±2.2
Full Results

OTIS Mock AIME 2024-25

Mathematics
This benchmark uses 45 integer-answer problems from unofficial Mock AIME exams (2024-2025). Problems are harder than MATH Level 5 but easier than FrontierMath and have answers between 0 and 999.
ModelScore
1GPT-5 (high)91.4% ±3.8
2GPT-5 (medium)87.2% ±3.9
3Grok 484.0% ±5.0
4o3 (high)83.9% ±4.4
5o4-mini (high)81.7% ±4.7
Full Results

MATH Level 5

Mathematics
The Level 5 subset of the MATH dataset contains the hardest competition-style problems from AMC 10, AMC 12 and AIME. Answers are scored using a combination of normalized string match, symbolic equivalence and model-graded equivalence.
ModelScore
1GPT-5 (high)98.1% ±0.3
2GPT-5 (medium)97.9% ±0.3
3o4-mini (high)97.8% ±0.3
4o3 (high)97.8% ±0.3
5Claude Sonnet 4.597.7% ±0.4
Full Results

FrontierMath

Mathematics
Contains several hundred unpublished expert-level mathematics problems spanning undergraduate through research-level mathematics. Problems are open-ended and solutions are graded manually with a rubric.
ModelScore
1Gemini 2.5 Deep Think29.0% ±2.7
2GPT-5 (high)26.6% ±2.6
3GPT-5 (medium)24.8% ±2.5
4GPT-5 mini (high)19.7% ±2.3
5GPT-5 mini (medium)19.3% ±2.3
Full Results

WeirdML v2

Machine Learning
Asks models to write code that trains machine-learning models to solve non-standard tasks (e.g., recognizing shapes, classifying digits, predicting chess outcomes). Models iterate on code, training and evaluating within a constrained environment.
ModelScore
1Gemini 2.5 Pro Exp (Mar '25)61.1%
2GPT-4.5 Preview (Feb '25)60.3%
3o4-mini (high)59.7%
4o1 (high)59.5%
5Claude 3.7 Sonnet (8k thinking)55.7%
Full Results

Fiction.liveBench

Long Context
Models must answer questions about long serialized stories hosted on Fiction.live. Questions probe recall of events, characters and chronological order. The benchmark emphasizes long-context comprehension.
ModelScore
1o3 (medium)100.0%
2Grok 496.9%
3GPT-5 (medium)96.9%
4Gemini 2.5 Pro Exp (Mar '25)90.6%
5o3-pro88.9%
Full Results

Terminal-Bench

Software Engineering
Models use a terminal to complete assignments such as editing files, running commands and debugging code. The tasks come from various agent frameworks; performance is the success rate.
ModelScore
1Claude Sonnet 4.5 (no thinking)60.3%
2Claude Opus 4.158.8%
3Claude Sonnet 454.8%
4GPT-5 (medium)52.5%
5Claude Opus 445.3%
Full Results

VPCT (Visual Physics Comprehension Test)

Science and Reasoning
Each problem shows an image of a ramp with buckets; the model must predict in which bucket a ball will land. Tasks test basic understanding of gravity and motion.
ModelScore
1GPT-5 (high)66.0%
2GPT-5 (medium)63.2%
3o4-mini (medium)57.5%
4o3 (medium)52.0%
5Gemini 2.5 Pro Preview (Mar '25)48.0%
Full Results

Factorio Learning Environment

Gaming
Agents must control Factorio to build factories. Tasks include a lab-play mode (three subtasks, e.g., crafting 100 transport belts) and an open-play mode requiring construction of a functioning factory.
ModelScore
1Claude 3.7 Sonnet29.1
2Claude 3.5 Sonnet (Jun '24)28.1
3Gemini 2.5 Pro Exp (Mar '25)18.4
4GPT-4o (Nov '24)16.6
5DeepSeek-V315.1
Full Results

GeoBench

Multimodal
Inspired by GeoGuessr. For each of 100 street-level photos (sampled from five community maps), models must guess the country and the precise latitude/longitude. Performance is measured using geographic distance and country accuracy; tasks require visual recognition, text extraction and geographic reasoning.
ModelScore
1Gemini 2.0 Flash Thinking Exp3873
2Gemini 2.5 Pro Exp (Mar '25)3871
3Gemini 2.5 Pro Preview (Jun '25)3836
4o3 (high)3789
5Gemini 2.0 Flash (Feb '25)3659
Full Results

Aider Polyglot

Coding
Problems are drawn from Exercism's challenging exercises in C++, Go, Java, JavaScript, Python and Rust. Each model gets two attempts per problem; failing the first attempt yields test errors, allowing the model to edit its code. Pass rate after the second attempt is the primary metric.
ModelScore
1GPT-5 (high)88.0%
2GPT-5 (medium)86.7%
3o3-pro84.9%
4Gemini 2.5 Pro Preview (Jun '25)83.1%
5GPT-5 (low)81.3%
Full Results