Knowledge & Reasoning Benchmarks

Introduction: How Are Reasoning Claims Actually Measured?

When OpenAI released o1 and announced “83% on AIME 2024,” what does that number actually mean? When Google says Gemini 2.5 Pro exceeds 90% on GPQA Diamond, does that imply AI has reached PhD-level scientific reasoning?

In the previous article, we established a high-level framework for benchmark evaluation. Now it is time to dive deep into knowledge and reasoning — the most fundamental and most widely cited evaluation dimension. This article will answer:

What are the main knowledge and reasoning benchmarks? What does each one actually measure?
From 2018 to 2025, what does the saturation trend look like?
Why did MMLU-Pro expand the number of options from 4 to 10? How does CoT evaluation differ?
Why is GPQA Diamond called “Google-Proof”? What is its adversarial validation process?

Knowledge Benchmarks: The Full Landscape

Knowledge benchmarks test how much factual knowledge a model has mastered — from humanities and social sciences to STEM, from undergraduate level to graduate level.

MMLU (2021)

Measuring Massive Multitask Language Understanding (Hendrycks et al., 2021) is the most influential knowledge evaluation benchmark. It contains approximately 15,908 four-choice multiple-choice questions across 57 subjects, covering STEM, humanities, social sciences, and other domains.

Evaluation method: 5-shot exact match (given 5 examples, the model outputs the option letter)
Current SOTA: Approximately 90-92% (GPT-4o, Claude 3.5 Sonnet, etc. all reach around 88%; reasoning models like o1 exceed 90%)
Human baseline: Domain experts approximately 89.8%
Current status: Essentially saturated. Top models have approached or even surpassed human expert performance. Wikipedia notes that as of 2025, MMLU has been partially deprecated in favor of harder alternatives

MMLU’s main problems are noisy questions and the high random guessing probability with 4 options (25%), which gave rise to its upgraded successor.

MMLU-Pro (2024)

MMLU-Pro (Wang et al., 2024) is a NeurIPS 2024 Spotlight paper specifically designed to address MMLU’s shortcomings:

Options expanded from 4 to 10, reducing the random guessing probability to 10%
Removed noisy and overly simple questions
Added questions requiring multi-step reasoning
Approximately 12,032 questions

Key finding: Model accuracy on MMLU-Pro drops by 16% to 33% compared to MMLU, while prompt sensitivity decreases from 4-5% to 2%, indicating that MMLU-Pro is more robust.

ARC-Challenge (2018)

AI2 Reasoning Challenge (Clark et al., 2018) contains 2,590 elementary school science multiple-choice questions, filtered to a subset that retrieval and statistical methods struggle to solve. This is one of the earliest reasoning benchmarks; the current SOTA exceeds 96% and is fully saturated, primarily used as a baseline comparison.

Reasoning Benchmarks: The Full Landscape

Reasoning benchmarks test a model’s logical reasoning, mathematical computation, and scientific reasoning capabilities. This is the area where benchmark evolution has been most dramatic — from 2021 to 2024, the community continuously released harder benchmarks to maintain discriminative power.

GSM8K (2021)

Grade School Math 8K contains approximately 8,500 (1,319 in the test set) elementary math word problems requiring 2-8 reasoning steps. It was once the entry-level reasoning benchmark, but the current SOTA has reached approximately 97% and is fully saturated.

MATH / MATH-500 (2021)

MATH (Hendrycks et al., 2021) contains 12,500 competition-level math problems covering 7 domains including algebra, geometry, and number theory, with difficulty levels from 1 to 5. MATH-500 is a commonly used 500-question test subset.

Current SOTA: Approximately 95% (o1 reaches 94.8%, o3 is higher)
Two years ago: GPT-4 scored only about 52%
Current status: Top reasoning models are approaching saturation, but standard models still have a significant gap

AIME 2024

American Invitational Mathematics Examination 2024 actual problems, totaling 30 questions (2 sets x 15 questions), with answers being integers from 0 to 999, naturally guessing-resistant.

Current SOTA: Approximately 87% (o3-mini high effort reaches 87.3%; o1-preview with 64-sample majority voting reaches 83%/12.5 out of 15, single attempt approximately 74%)
Current status: Still discriminative, effectively differentiating the mathematical capabilities of different reasoning models

BBH (2022)

BIG-Bench Hard selects 23 difficult reasoning tasks from BIG-Bench’s 204 tasks where language models perform below human level, totaling approximately 6,511 questions. BBH’s key contribution was demonstrating that CoT prompting can significantly improve reasoning task performance. Current SOTA is approximately 88-95%, approaching saturation.

GPQA Diamond (2023)

Graduate-Level Google-Proof Q&A (Rein et al., 2024) is currently the most authoritative high-difficulty reasoning benchmark. Diamond is its high-quality subset of 198 questions.

Domain expert accuracy: 74% (with tolerance)
Non-experts + 30 minutes of searching: Only 34%
Current SOTA: Approximately 88% (o3 reaches 87.7%); Gemini 2.5 Pro (thinking) claims 94.3%
Current status: Reasoning models are beginning to surpass human expert levels, but the benchmark still differentiates models of varying capability

FrontierMath (2024)

Frontier-level mathematical research problems created by mathematicians organized by Epoch AI. Problems are not publicly available; experts need hours to days to solve them.

Current SOTA: o3 approximately 25%, other models below 2% (confirmed by Epoch AI)
Current status: Far from saturated, currently the most challenging benchmark for AI mathematical capability

Benchmark Saturation Map

The scatter plot below shows the saturation status of these 10 knowledge and reasoning benchmarks. The higher a bubble, the closer it is to saturation; the larger it is, the more widely it is cited:

知识与推理 Benchmark 饱和度地图

知识

推理

数学

气泡大小 = 引用频率

Key observations: Notice the “saturation zone” (>90%) in the upper right, where early benchmarks like MMLU, GSM8K, ARC, and HellaSwag cluster. FrontierMath in the lower left represents the current frontier of AI capability. With the emergence of reasoning models (o1/o3), benchmarks in the middle zone (GPQA, AIME, MATH-500) are rapidly moving toward the saturation line.

Trend Analysis: Benchmark Saturation and Iteration

From 2018 to 2025, knowledge and reasoning benchmarks show a clear saturation-replacement cycle:

First wave (2018-2019): ARC-Challenge and HellaSwag launched as early reasoning benchmarks, still challenging in the GPT-2 era
Second wave (2021): MMLU, GSM8K, and MATH became the primary benchmarks, serving as core metrics in the GPT-3/GPT-4 era
Third wave (2022-2023): BBH and GPQA Diamond raised the difficulty bar, still maintaining discriminative power for GPT-4-class models
Fourth wave (2024): MMLU-Pro, AIME 2024, and FrontierMath represent the latest generation of benchmarks, designed specifically for the reasoning model era (o1/o3)

Each wave of new benchmarks is driven by the same force: old benchmarks saturate -> lose discriminative power -> harder tests are needed. This is not a flaw but rather indirect evidence of the rapid improvement in LLM capabilities.

Deep Dive 1: MMLU-Pro

Why Deep Dive into MMLU-Pro?

MMLU-Pro is one of the most widely used knowledge evaluation benchmarks today. It addresses MMLU’s core problems and has been adopted as a standard reporting metric by major labs including OpenAI, Anthropic, Google, and Meta. Understanding MMLU-Pro’s design and evaluation methodology helps correctly interpret the evaluation data in virtually every new model release.

Dataset Composition

MMLU-Pro contains approximately 12,032 questions distributed across 14 subject areas:

Domain	Example Subjects	Key Characteristics
STEM	Physics, Chemistry, Mathematics, Engineering, Computer Science, Biology	Highest proportion, strong reasoning demands
Social Sciences	Economics, Psychology, Law	Requires domain knowledge + analysis
Humanities	Philosophy, History	Some questions involve critical thinking
Other	Business, Health	Application-oriented

Key differences from MMLU:

Number of options: 4 -> 10. This change has far-reaching implications — it not only reduces guessing probability but also requires stronger distractor elimination ability from models
Question filtering: Removed noisy questions from the original MMLU that were flagged by multiple annotators as “ambiguous” or “potentially incorrect”
Reasoning demands: Added a large number of questions that require multi-step reasoning rather than pure knowledge recall

Evaluation Protocol: 5-shot CoT

MMLU-Pro’s standard evaluation protocol is 5-shot Chain-of-Thought (CoT):

Provide 5 example questions from the same subject in the prompt, each including a complete reasoning process and final answer
The model must first write out its reasoning steps for the new question, then provide the option letter
Extract the final option from the model’s output and do exact match against the standard answer

Why use CoT instead of direct answer? One of the paper’s core findings is: On MMLU-Pro, CoT improves accuracy by an average of 10-20 percentage points over direct answering, far greater than the difference on original MMLU (approximately 0-2%). This demonstrates that MMLU-Pro genuinely measures reasoning ability, not just knowledge recall.

MMLU-Pro 评估演示：10 选项 + CoT

物理学

在自由落体运动中，一个物体从静止开始下落。忽略空气阻力，2秒后物体的速度最接近以下哪个值？

A. 5 m/s

B. 10 m/s

C. 15 m/s

D. 20 m/s

E. 25 m/s

F. 30 m/s

G. 35 m/s

H. 40 m/s

I. 45 m/s

J. 50 m/s

10 选项 → 随机猜测概率仅 10%（vs MMLU 的 25%），更能区分真实理解和瞎猜

直接回答

A. 5 m/s

B. 10 m/s

C. 15 m/s

D. 20 m/s

E. 25 m/s

F. 30 m/s

G. 35 m/s

H. 40 m/s

I. 45 m/s

J. 50 m/s

置信度

18%

选了 B — 错误 ✗

Chain-of-Thought

点击下方按钮开始推理…

Interactive guide: The left side shows the “direct answer” mode — the model selects B without thinking (incorrect), with only 18% confidence. On the right side, click the button to step through the CoT reasoning process, where the model derives the correct answer D through the $v = gt$ formula, with 82% confidence.

The Deeper Reason Behind CoT vs Direct

Why do 10 options amplify the advantage of CoT?

4 options: Even if the model is uncertain, it has a 25% chance of guessing correctly. The “surface accuracy” of direct answering is artificially inflated
10 options: Guessing probability drops to 10%, so the model must genuinely understand to select correctly. CoT’s reasoning chain helps the model progressively narrow down candidates, producing a significant effect
Prompt sensitivity: Changing the prompt template on MMLU can cause 4-5% score fluctuations, while MMLU-Pro reduces this to approximately 2%, meaning evaluation results are more reliable

Deep Dive 2: GPQA Diamond

Why Deep Dive into GPQA Diamond?

GPQA Diamond is the key benchmark for differentiating reasoning model capabilities today. In the evaluation reports of reasoning models like o1, o3, and Gemini 2.5 Pro, GPQA Diamond almost invariably appears. Its unique value lies in the fact that even when non-experts are given a search engine and 30 minutes, their accuracy is only 34% — this benchmark is truly “Google-Proof.”

Dataset Composition

GPQA (Rein et al., 2024) contains 448 PhD-level science multiple-choice questions covering three domains:

Domain	Proportion	Example Topics
Physics	~33%	Quantum mechanics, statistical mechanics, particle physics
Chemistry	~33%	Organic chemistry, quantum chemistry, thermodynamics
Biology	~33%	Molecular biology, genetics, ecology

Diamond is the most rigorously filtered 198-question subset — every question has undergone an “adversarial validation” process.

Adversarial Validation Process

GPQA’s question production pipeline is its core innovation:

Question authors (Domain Experts, PhDs or PhD students in the field) write a multiple-choice question
Same-domain validators (another Domain Expert) attempt to solve it — they should be able to answer correctly
Cross-domain validators (PhDs from a different field) attempt to solve it with 30 minutes of Google searching — they should fail
Only questions where “experts answer correctly and non-experts still fail after searching” enter the Diamond subset

This adversarial design ensures two key properties:

Difficulty guarantee: These are not simple factual questions you can “just look up,” but reasoning questions requiring deep domain understanding
Google-Proof: Cannot be “cheated” by searching for answers, which also means the impact of data contamination is relatively small

Why Is GPQA Diamond Hard?

Taking physics as an example, a typical GPQA Diamond question might require:

Understanding the Hamiltonian operator in quantum mechanics
Applying a specific approximation method (e.g., perturbation theory)
Performing multi-step mathematical derivation
Eliminating 3 out of 4 carefully crafted options (distractors are based on common reasoning errors)

This combination of deep reasoning + specialized knowledge makes it challenging even for GPT-4-class models (~39% on the full GPQA set). However, the emergence of reasoning models is changing the landscape: o3 reaches 87.7%, surpassing the human domain expert level of 74%.

Evaluation Method

GPQA Diamond typically uses 0-shot or few-shot CoT:

The model receives the complete question and 4 options
It must first write out its reasoning process, then provide the option
Exact match scoring

It is worth noting that GPQA Diamond scores across different reports may use different shot settings and CoT strategies; when comparing results, one should verify that the evaluation protocols are consistent.

Transition: From Knowledge and Reasoning to Code and Agents

Knowledge and reasoning benchmarks form the cornerstone of LLM evaluation — they measure a model’s fundamental ability to “understand the world” and “think logically.” But in practical applications, we also care about whether a model can write correct code and whether it can use tools to complete complex tasks.

The next article, Code Benchmarks Deep Dive, will delve into HumanEval, SWE-bench, LiveCodeBench, and other code evaluation benchmarks, exploring the complete evaluation spectrum from “function-level code generation” to “project-level software engineering.”

If you are more interested in Agent and tool-use evaluation, you can jump directly to Agent Benchmarks Deep Dive.