Anatomy of Model Release Benchmark Standard Sets

Opening: Why Are Releases Always About the Same Benchmarks?

Every time a new model is released, the technical report includes a big table — MMLU score here, HumanEval score there, GPQA Diamond score over there… If you look closely, you’ll notice that the benchmarks different vendors choose to report overlap heavily, yet are never quite identical. Two key questions lurk behind this pattern:

The shared choices: Why these benchmarks and not others? How did they become the “standard set”?
The deliberate omissions: When a model doesn’t report a certain score, it’s usually not an oversight — the score just doesn’t look good. What you don’t report is just as telling as what you do.

This article systematically examines the current standard set of benchmarks for model releases, analyzes how evaluation frameworks differ between frontier models and small models, and provides an interactive matrix that lets you see each model’s strengths, weaknesses, and “blank spots” at a glance.

Timeliness disclaimer: This article is based on model releases up to early 2025. The benchmark ecosystem evolves rapidly, and specific scores and popular evaluation suites change over time. Our focus is on the selection logic and analytical methods, not on tracking the latest rankings.

Evolution of the Standard Set: From the “Big Four” to Full Coverage

2023: The “Big Four” Era

In the early days of the LLM boom sparked by ChatGPT, model technical reports typically only needed to report four core benchmarks:

Benchmark	What It Tests	Why It’s Standard
MMLU	Knowledge breadth across 57 subjects	The most widely cited knowledge baseline
HumanEval	Function-level code generation	The pass@1 paradigm is clean and unambiguous
GSM8K	Elementary math reasoning	The “entry exam” for reasoning ability
HellaSwag	Commonsense reasoning / language understanding	NLU baseline, high discriminating power in the early days

These four benchmarks plus Chatbot Arena’s ELO ranking essentially formed a model’s “resume.”

2024: Expansion and Divergence

As model capabilities improved, the original benchmarks began to saturate (ceiling effect). GPT-4-class models hit 90%+ on both MMLU and GSM8K, and discriminating power dropped sharply. In response:

MMLU → MMLU-Pro: Expanded from 4 options to 10, introduced stronger reasoning requirements, reduced prompt sensitivity from 4-5% to 2%
GSM8K → MATH-500 / AIME: From elementary math to competition-level, re-opening the gap
HumanEval → SWE-bench Verified: From function-level to project-level, testing real software engineering ability
New Agent dimension: BFCL (function calling) and GAIA (multi-step interaction) started appearing
New IFEval: A dedicated test for instruction-following ability

2025: The Current Standard

As of now, a frontier model release must cover at minimum the following dimensions:

Dimension	Must-Report Benchmarks	Nice-to-Have
Knowledge	MMLU / MMLU-Pro	IFEval
Reasoning	GPQA Diamond, MATH-500	AIME, BBH
Code	HumanEval, SWE-bench Verified	LiveCodeBench
Agent	BFCL or similar	GAIA, WebArena
Preference	Chatbot Arena ELO	MT-Bench

This is the “benchmark standard set” — it wasn’t mandated by any institution but emerged naturally through competitive equilibrium: whatever you don’t report, your competitors will analyze why you didn’t.

Frontier Model Comparison: The Must-Report Intersection and Strategic Omissions

The Shared Must-Report Items Across Four Players

Analyzing the technical reports of four frontier models — GPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Gemini 1.5 Pro (Google), and Llama 3.1 405B (Meta) — we can extract the benchmarks that all four report:

MMLU (or MMLU-Pro): The “consensus baseline” for knowledge breadth
GPQA Diamond: The new gold standard for PhD-level reasoning
MATH (or MATH-500): The core metric for mathematical reasoning
HumanEval: The “de facto standard” for code generation
BBH: An important supplement for comprehensive reasoning

These five form the current minimum must-report set for frontier models.

”What’s Not Reported” Analysis

Even more interesting are each player’s strategic omissions:

Claude 3.5 Sonnet doesn’t report AIME or ARC-C. Claude leads by a wide margin on SWE-bench at 49.0% (vs. GPT-4o’s 38.4%), but doesn’t report AIME scores — hinting that it may not have an advantage in ultra-difficult math competitions. It also doesn’t report BFCL, signaling in the function calling dimension that “we focus more on code and reasoning.”

Gemini 1.5 Pro doesn’t report specific scores for some benchmarks, particularly SWE-bench and GAIA. Google’s strategy emphasizes multimodality and long context (million-token context window) rather than competing head-to-head on text benchmarks. However, the Gemini 1.5 technical report does include evaluation data for BFCL and IFEval.

GPT-4o has the most comprehensive coverage — it reports scores for nearly all mainstream benchmarks, reflecting OpenAI’s confidence as the industry standard-bearer: there are no weaknesses that “need to be hidden.”

Llama 3.1 405B, as an open-source model, reports a very comprehensive set of scores, but is missing AIME and GAIA. Open-source models have a unique advantage: even if you don’t report a score, the community will run the benchmarks for you.

Key insight: When you see a model’s evaluation report, first count which benchmarks it reports, then think about which ones it didn’t. The omissions are information in themselves.

Small Model Evaluation: Different Rules of the Game

Frontier models and small models (≤10B parameters) face different levels of competition. Small model evaluation has several key differences:

1. More Conservative Benchmark Selection

Small models typically don’t report benchmarks like SWE-bench or GAIA that require complex multi-step reasoning — not because they’re hiding anything, but because these tasks are too difficult for small models, and reporting single-digit scores would have no reference value.

2. Different Competitors

The comparison targets for small models are other small models in the same tier, not GPT-4o. So in the Gemma 2 9B report, you’ll see comparisons with Llama 3 8B and Mistral 7B, not with Claude 3.5 Sonnet.

3. “Efficiency Ratio” as the Core Narrative

The selling point for small models isn’t having the highest absolute scores, but rather “achieving a 30B model’s performance with only 9B parameters.” So the evaluation focuses on:

Who’s best within the same parameter tier
How much improvement there is over the previous generation at the same tier
Which tasks have the best “performance-per-parameter”

4. Deployment Scenario-Driven

Small models care more about practical usability on edge devices and endpoints. So some reports will additionally test:

Inference speed and memory usage
Accuracy retention after quantization
Performance in specific languages or domains

Full Comparison of Small Models from Major Vendors

The interactive matrix below shows 9 representative models (4 frontier + 5 small) across 14 mainstream benchmarks. Gray striped cells indicate that the model did not report this score — pay special attention to these blank areas.

模型 × Benchmark 热力矩阵

知识

推理

代码

智能体

偏好

MMLU

MMLU-Pro

IFEval

GPQA Diamond

MATH-500

AIME 2024

BBH

ARC-C

HumanEval

SWE-bench Verified

LiveCodeBench

BFCL

GAIA

Arena ELO

Frontier 模型

GPT-4o

88.7

72.6

84.3

56.1

76.6

9.3

83.6

96.4

90.2

38.4

N/R

88.5

40.5

1285

Claude 3.5 Sonnet

88.7

78.3

N/R

93.1

N/R

1271

Gemini 1.5 Pro

85.9

N/R

46.2

67.7

N/R

89.2

N/R

84.1

N/R

1260

Llama 3.1 405B

87.3

73.3

88.6

50.7

73.8

N/R

85.9

96.9

33.2

N/R

1253

小模型 (≤10B)

Gemma 2 9B

71.3

N/R

36.6

N/R

68.2

68.4

40.2

N/R

1187

Phi-3 Mini 3.8B

70.9

N/R

30.6

N/R

73.5

86.3

57.3

N/R

Qwen 2.5 7B

74.2

56.3

71.2

36.4

75.5

N/R

70.4

N/R

84.8

N/R

Llama 3.1 8B

69.4

48.3

80.4

30.4

51.9

N/R

64.2

83.4

72.6

N/R

1176

Mistral 7B

62.5

N/R

78.5

32.9

N/R

1072

列内高分

列内低分

N/R = 未报告（暗示弱项）

每列独立归一化，颜色仅反映列内相对排名

Usage tip: Hover over any cell to see the exact score and source. Click the toggle button to group by model family for side-by-side comparison of large and small models within the same family. Note that the color in each column is independently normalized — green indicates a relatively high score within that column.

Gemma 2 9B (Google)

Google’s Gemma 2 reports a relatively traditional set of benchmarks: MMLU, ARC-C, BBH (BIG-Bench Hard), HumanEval, and MATH. Notably, Gemma does not report MMLU-Pro, GPQA Diamond, or IFEval. Its HumanEval score of only 40.2% is significantly lower than the similarly-sized Qwen 2.5 7B (84.8%) and Llama 3.1 8B (72.6%) — code generation is a clear weakness for Gemma 2 9B.

Phi-3 Mini 3.8B (Microsoft)

Microsoft’s Phi-3 achieves an impressive MMLU of 70.9% with only 3.8B parameters — nearly matching the 9B-class Gemma 2 (71.3%). Phi-3’s ARC-C of 86.3% is the highest among small models, and its BBH of 73.5% is also notable. However, Phi-3 does not report MMLU-Pro, IFEval, or MATH-500, and its HumanEval is only 57.3%. Microsoft’s narrative is “achieving high-quality reasoning with minimal parameters,” but the code and instruction-following dimensions are intentionally downplayed.

Qwen 2.5 7B (Alibaba)

Qwen 2.5 is one of the most comprehensively reported small models — covering MMLU, MMLU-Pro, IFEval, GPQA Diamond, MATH-500, BBH, and HumanEval. Particularly outstanding is MATH-500 at 75.5%, far ahead among small models and even approaching frontier-model levels (GPT-4o is 76.6%). HumanEval at 84.8% is also the highest among small models. Qwen’s weakness is GPQA Diamond (34.2%), but at least it chose to report rather than hide the score.

Llama 3.1 8B (Meta)

Meta’s Llama 3.1 8B benefits from the open-source ecosystem and is the most thoroughly tested model by third parties. The official report covers MMLU, MMLU-Pro, GPQA Diamond, MATH-500, BBH, HumanEval, IFEval, and ARC-C — nearly the most comprehensive coverage among small models. Scores are balanced but without particularly outstanding areas: MMLU at 69.4% (only above Mistral 7B’s 62.5%), though IFEval at 80.4% is relatively strong.

Mistral 7B (Mistral AI)

As an earlier model, Mistral 7B reports the fewest benchmarks — only MMLU (62.5%), ARC-C (78.5%), and HumanEval (32.9%). The large number of “N/R” entries reflects the fact that when Mistral 7B was released (September 2023), the benchmark standard set had not yet formed. This also illustrates the temporal evolution of the standard set: scores that didn’t need to be reported in 2023 had become mandatory by 2024.

Three Major Pitfalls of Score Comparability

When comparing numbers in the matrix, there are several critical comparability issues to be aware of:

1. Prompt Template Differences

The same benchmark can yield 3-5% score differences depending on the prompt template. For example, the classic MMLU question format:

The following is a multiple choice question...
A. ...  B. ...  C. ...  D. ...
Answer:

But some vendors add a system prompt, some adjust the option format, and some use a chat template instead of a raw prompt. The importance of HuggingFace’s Open LLM Leaderboard lies precisely in the fact that it standardizes the evaluation pipeline.

2. Inconsistent Few-Shot Counts

MMLU has two common protocols — 0-shot and 5-shot — with score differences of 3-8%. When you see “MMLU: 88%,” you must verify which protocol was used. The scores in our matrix use each model’s official reported values and protocols wherever possible, but protocols across different models may not be fully consistent — this is also why our colors represent relative ranking within each column, not absolute value comparisons.

3. Evaluation Tool Versions

Different versions of lm-evaluation-harness implement certain benchmarks differently. In particular, the upgrade from Harness v0.3 to v0.4 changed the prompt templates for multiple tasks. If you want to make strict apples-to-apples comparisons, be sure to use the same version of the harness and re-evaluate from scratch.

Practical advice: Don’t obsess over third-decimal-place score differences. If two models differ by less than 2% on a given benchmark, they can essentially be considered “on par.” What truly matters is differences of 5% or more and the overall coverage of capabilities.

Looking Ahead: The Future of the Standard Set

The current benchmark standard set is still evolving rapidly. Several trends worth watching:

MMLU’s exit: Due to data quality issues and saturation effects, MMLU is being replaced by MMLU-Pro. Open LLM Leaderboard v2 has already swapped MMLU for MMLU-Pro
The rise of agent evaluation: As LLMs move from “answering questions” to “executing tasks,” benchmarks like BFCL, GAIA, and SWE-bench are becoming increasingly important
Dynamic benchmarks becoming standard: The dynamic update strategies of LiveCodeBench and LiveBench are becoming the standard approach for contamination prevention
Multimodal expansion: Multimodal benchmarks like MMMU and MathVista are starting to appear in technical reports

Next Steps

This article established a comprehensive understanding of model release benchmarks. The next article, Impact of Optimization on Accuracy, examines benchmark scores from a different angle — when we apply optimizations like quantization and distillation to models, the accuracy loss varies dramatically across different benchmarks. This is critical for choosing edge deployment strategies.