LM Council is a multi-model workspace that lets you orchestrate simultaneous responses from GPT-5, Claude Opus 4, Gemini 2.5, Grok 4, and other frontier AI systems inside a single conversation.

How do I customize which models participate?

You can add, replace, or remove models from each slot, letting you mix GPT-5, Claude Opus 4, Gemini 2.5 Pro, Grok 4, O3, and other providers in one coordinated exchange.

Can I direct a prompt to a specific model?

Use the at-mention picker to target individual models. If no model is selected, every member of the council responds and you can compare their reasoning side-by-side.

How do shared conversations work?

Any transcript can be saved as a read-only link, making it easy to showcase comparisons or hand off results to teammates without exposing private data. When someone creates and verifies an account directly from a page you shared, you also progress toward Rewards like custom emoji defaults and the Mystery Model category.

What rewards can I unlock by sharing?

Sharing councils, chats, or code previews that lead to a verified signup unlocks new perks: one verification lets you set the default emoji on the composer button, and five verifications unlock the gold Mystery Model category that loads a random model you can reveal on demand.

What plans are available?

A free tier provides community models and baseline credits, while Pro ($14/month) and Max ($49/month) plans unlock premium systems including GPT-5, Claude Opus 4, Gemini 2.5, automation features, and higher monthly allowances.

What is the Leader feature?

Leader mode assigns one model to synthesize the council, producing a consolidated summary or recommendation once every other model has finished responding.

Which AI model is best for coding in 2025?

Based on Epoch AI benchmarks, Claude Sonnet 4.5 leads in SWE-bench Verified (64.8%), followed by Claude Opus 4.1 (63.2%). For pure coding speed, GPT-5 excels in Aider Polyglot (88%). LM Council lets you test all top models side-by-side to find the best fit for your specific coding tasks.

What is the best AI model for mathematics?

GPT-5 (medium) achieves the highest score on MATH Level 5 (97.9%), while Gemini 2.5 Deep Think leads in research-level FrontierMath (29%). For advanced mathematical reasoning, GPT-5 and Claude Opus 4 both perform exceptionally well. You can compare these models directly on LM Council.

How do GPT-5 and Claude Opus 4 compare?

GPT-5 excels in mathematics (97.9% on MATH Level 5) and long-horizon tasks (137.3-hour METR time horizon), while Claude Opus 4 leads in coding tasks (62.2% SWE-bench) and terminal operations (45.3% Terminal-Bench). LM Council enables direct comparison of both models across multiple benchmarks and real-world tasks.

What are the most reliable AI benchmarks?

This page includes 20 standardized benchmarks from Epoch AI and Scale AI, including Humanity's Last Exam (2,500 expert-level questions), FrontierMath (research-level math), GPQA Diamond (PhD-level science), SWE-bench Verified (real-world coding), and METR Time Horizons (long-task endurance). All benchmarks use transparent methodology and verified testing.

Can I test these benchmark-winning models myself?

Yes! LM Council provides direct access to GPT-5, Claude Opus 4, Gemini 2.5 Pro, Grok 4, and other top-performing models. You can run them simultaneously to compare outputs, or use our benchmark comparison tool to see detailed performance metrics across all categories.

Which AI model is best for long documents?

On Fiction.liveBench (long-context comprehension), o3 (medium) achieves 100%, followed by Grok 4 and GPT-5 (medium) at 96.9%. Gemini 2.5 models also excel at extended context tasks with strong performance across multiple long-context benchmarks.

How is GPT-5 different from GPT-4?

GPT-5 shows dramatic improvements: 97.9% on MATH Level 5 (vs GPT-4o's lower performance), 137.3-hour METR time horizon indicating superior long-task capabilities, and 88% on Aider Polyglot for multi-language coding. Compare them directly on LM Council to see the difference in real-world applications.

What is Epoch AI and why trust these benchmarks?

Epoch AI is an independent research organization that maintains standardized, peer-reviewed benchmarks for AI model evaluation. Their testing hub includes 18 distinct benchmarks covering mathematics, science, coding, and real-world tasks, updated regularly with transparent methodology and published results.

Which models are best for science and reasoning?

Grok 4 leads in GPQA Diamond (87%), a PhD-level science benchmark, followed by GPT-5 (85.4%). For common-sense reasoning on SimpleBench, Gemini 2.5 Pro (62.4%) and GPT-5 Pro (61.6%) perform best. Test these models on LM Council for scientific queries.

How can I compare specific AI models across benchmarks?

Use our interactive comparison tool on this page to select any two models (like GPT-5 vs Claude Opus 4) and see side-by-side performance across all 20 benchmarks. Results show which model wins in each category, and you can create a council with both models to test them on your own prompts.

Are these benchmark results updated regularly?

Yes, this page reflects the latest benchmark results from Epoch AI and Scale AI as of October 27, 2025. We update the data when new results are released or new benchmarks are added. Check the "Last updated" date at the bottom of the page for the most recent update timestamp.

What is Humanity's Last Exam?

Humanity's Last Exam (HLE) is a 2,500-question benchmark created by Scale AI and the Center for AI Safety, featuring expert-level questions across mathematics, humanities, and natural sciences. It tests both depth of reasoning and breadth of knowledge, with current frontier models achieving only 10-25% accuracy, demonstrating significant room for AI advancement.

AI Model Benchmarks Feb 2026

20 benchmarks - the world's most-followed benchmarks, curated by AI Explained, author of SimpleBench

Independently-run benchmarks by Epoch, Scale and others, so may not match self-reported scores by AI orgs.

Compare Models

Model 1:

Model 2:

Search:

Humanity's Last Exam

2,500 of the toughest, subject-diverse, multi-modal questions designed to test for both depth of reasoning and breadth of knowledge. Created in partnership with the Center for AI Safety, HLE includes questions across mathematics, humanities, and natural sciences from nearly 1,000 expert contributors.

	Models (no tools)	Score
1	Gemini 3 Pro Preview	37.52% ±1.90
2	GPT-5 (August '25)	25.32% ±1.70
3	Kimi K2 Thinking	23.9% ±1.61
4	Gemini 2.5 Pro Preview (June '25)	21.64% ±1.61
5	o3 (high) (April '25)	20.32% ±1.58

	Model	Score
1	Gemini 3 Pro Preview	76.4%
2	Claude Opus 4.6	67.6%
3	Gemini 2.5 Pro Preview (Jun '25)	62.4%
4	Claude Opus 4.5	62.0%
5	GPT-5 Pro	61.6%

	Model	Minutes
1	GPT-5 (medium)	137.3 ±102.1
2	Claude Sonnet 4.5	113.3 ±91.4
3	Grok 4	110.1 ±91.8
4	Claude Opus 4.1	105.5 ±69.2
5	o3 (medium)	91.3 ±58.8

	Model	Score
1	Claude Sonnet 4.5	64.8% ±2.1
2	Claude Opus 4.1	63.2% ±2.2
3	Claude Opus 4	62.2% ±2.2
4	Claude Sonnet 4	60.6% ±2.2
5	Claude Haiku 4.5	60.6% ±2.2

	Model	Score
1	Gemini 3 Pro Preview	92.6% ±1.7
2	Grok 4	87.0% ±2.0
3	GPT-5 (high)	86.2% ±2.1
4	GPT-5 (medium)	85.4% ±2.1
5	Gemini 2.5 Pro Preview (Jun '25)	84.8% ±2.6

	Model	Score
1	Claude Opus 4.1	43.6%
2	GPT-5 (high)	34.8%
3	GPT-5 (medium)	33.9%
4	GPT-5 (low)	31.9%
5	o3 (high)	30.8%

	Model	Score
1	o3 (high)	8.8%
2	Claude Opus 4	6.9%
3	GPT-5 (high)	6.9%
4	Claude Sonnet 4	4.9%
5	Kimi K2	4.9%

	Model	Score
1	Claude Sonnet 4.5	57.7%
2	GPT-5 (low)	57.4%
3	Claude Opus 4.1	56.4%
4	Claude Opus 4	56.3%
5	GPT-5 (medium)	56.0%

	Model	Score
1	GPT-5 (high)	1477.5
2	Claude Opus 4.1	1472.4
3	Claude Opus 4.1	1462.3
4	Claude Sonnet 4.5 (32k thinking)	1420.8
5	Gemini 2.5 Pro	1401.0

	Model	Score
1	Grok 4	43.6% ±2.2
2	Gemini 2.5 Pro Exp (Mar '25)	40.4% ±2.3
3	DeepSeek-R1	34.9% ±2.2
4	Gemini 2.5 Flash	33.5% ±2.1
5	GPT-5 (minimal)	32.8% ±2.2

	Model	Score
1	Gemini 3 Pro Preview	91.4% ±3.7
2	GPT-5 (high)	91.4% ±3.8
3	GPT-5 (medium)	87.2% ±3.9
4	Grok 4	84.0% ±5.0
5	o3 (high)	83.9% ±4.4

	Model	Score
1	GPT-5 (high)	98.1% ±0.3
2	GPT-5 (medium)	97.9% ±0.3
3	o4-mini (high)	97.8% ±0.3
4	o3 (high)	97.8% ±0.3
5	Claude Sonnet 4.5	97.7% ±0.4

	Model	Score
1	Gemini 3 Pro Preview	37.6% ±2.8
2	Gemini 2.5 Deep Think	29.0% ±2.7
3	GPT-5 (high)	26.6% ±2.6
4	GPT-5 (medium)	24.8% ±2.5
5	GPT-5 mini (high)	19.7% ±2.3

	Model	Score
1	Gemini 3 Pro Preview	69.9%
2	Gemini 2.5 Pro Exp (Mar '25)	61.1%
3	GPT-4.5 Preview (Feb '25)	60.3%
4	o4-mini (high)	59.7%
5	o1 (high)	59.5%

	Model	Score
1	o3 (medium)	100.0%
2	Grok 4	96.9%
3	GPT-5 (medium)	96.9%
4	Gemini 2.5 Pro Exp (Mar '25)	90.6%
5	o3-pro	88.9%

	Model	Score
1	Claude Sonnet 4.5 (no thinking)	60.3%
2	Claude Opus 4.1	58.8%
3	Claude Sonnet 4	54.8%
4	GPT-5 (medium)	52.5%
5	Claude Opus 4	45.3%

	Model	Score
1	Gemini 3 Pro Preview	91.0%
2	GPT-5 (high)	66.0%
3	GPT-5 (medium)	63.2%
4	o4-mini (medium)	57.5%
5	o3 (medium)	52.0%

	Model	Score
1	Claude 3.7 Sonnet	29.1
2	Claude 3.5 Sonnet (Jun '24)	28.1
3	Gemini 2.5 Pro Exp (Mar '25)	18.4
4	GPT-4o (Nov '24)	16.6
5	DeepSeek-V3	15.1

	Model	Score
1	Gemini 3 Pro Preview	3893
2	Gemini 2.0 Flash Thinking Exp	3873
3	Gemini 2.5 Pro Exp (Mar '25)	3871
4	Gemini 2.5 Pro Preview (Jun '25)	3836
5	o3 (high)	3789

	Model	Score
1	GPT-5 (high)	88.0%
2	GPT-5 (medium)	86.7%
3	o3-pro	84.9%
4	Gemini 2.5 Pro Preview (Jun '25)	83.1%
5	GPT-5 (low)	81.3%

AI Model Benchmarks Feb 2026

Compare Models

Humanity's Last Exam

SimpleBench

METR Time Horizons

SWE-bench Verified

GPQA Diamond

GDPval

GSO (General Speedup Optimization)

DeepResearchBench

WebDev Arena

BALROG

OTIS Mock AIME 2024-25

MATH Level 5

FrontierMath

WeirdML v2

Fiction.liveBench

Terminal-Bench

VPCT (Visual Physics Comprehension Test)

Factorio Learning Environment

GeoBench

Aider Polyglot