Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Anatomy of Model Release Benchmark Standard Sets

Anatomy of Model Release Benchmark Standard Sets

Updated 2026-04-14

Opening: Why Are Releases Always About the Same Benchmarks?

Every time a new model is released, the technical report includes a big table — MMLU score here, HumanEval score there, GPQA Diamond score over there… If you look closely, you’ll notice that the benchmarks different vendors choose to report overlap heavily, yet are never quite identical. Two key questions lurk behind this pattern:

  1. The shared choices: Why these benchmarks and not others? How did they become the “standard set”?
  2. The deliberate omissions: When a model doesn’t report a certain score, it’s usually not an oversight — the score just doesn’t look good. What you don’t report is just as telling as what you do.

This article systematically examines the current standard set of benchmarks for model releases, analyzes how evaluation frameworks differ between frontier models and small models, and provides an interactive matrix that lets you see each model’s strengths, weaknesses, and “blank spots” at a glance.

Timeliness disclaimer: This article is based on model releases up to early 2025. The benchmark ecosystem evolves rapidly, and specific scores and popular evaluation suites change over time. Our focus is on the selection logic and analytical methods, not on tracking the latest rankings.

Evolution of the Standard Set: From the “Big Four” to Full Coverage

2023: The “Big Four” Era

In the early days of the LLM boom sparked by ChatGPT, model technical reports typically only needed to report four core benchmarks:

BenchmarkWhat It TestsWhy It’s Standard
MMLUKnowledge breadth across 57 subjectsThe most widely cited knowledge baseline
HumanEvalFunction-level code generationThe pass@1 paradigm is clean and unambiguous
GSM8KElementary math reasoningThe “entry exam” for reasoning ability
HellaSwagCommonsense reasoning / language understandingNLU baseline, high discriminating power in the early days

These four benchmarks plus Chatbot Arena’s ELO ranking essentially formed a model’s “resume.”

2024: Expansion and Divergence

As model capabilities improved, the original benchmarks began to saturate (ceiling effect). GPT-4-class models hit 90%+ on both MMLU and GSM8K, and discriminating power dropped sharply. In response:

  • MMLU → MMLU-Pro: Expanded from 4 options to 10, introduced stronger reasoning requirements, reduced prompt sensitivity from 4-5% to 2%
  • GSM8K → MATH-500 / AIME: From elementary math to competition-level, re-opening the gap
  • HumanEval → SWE-bench Verified: From function-level to project-level, testing real software engineering ability
  • New Agent dimension: BFCL (function calling) and GAIA (multi-step interaction) started appearing
  • New IFEval: A dedicated test for instruction-following ability

2025: The Current Standard

As of now, a frontier model release must cover at minimum the following dimensions:

DimensionMust-Report BenchmarksNice-to-Have
KnowledgeMMLU / MMLU-ProIFEval
ReasoningGPQA Diamond, MATH-500AIME, BBH
CodeHumanEval, SWE-bench VerifiedLiveCodeBench
AgentBFCL or similarGAIA, WebArena
PreferenceChatbot Arena ELOMT-Bench

This is the “benchmark standard set” — it wasn’t mandated by any institution but emerged naturally through competitive equilibrium: whatever you don’t report, your competitors will analyze why you didn’t.

Frontier Model Comparison: The Must-Report Intersection and Strategic Omissions

The Shared Must-Report Items Across Four Players

Analyzing the technical reports of four frontier models — GPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Gemini 1.5 Pro (Google), and Llama 3.1 405B (Meta) — we can extract the benchmarks that all four report:

  • MMLU (or MMLU-Pro): The “consensus baseline” for knowledge breadth
  • GPQA Diamond: The new gold standard for PhD-level reasoning
  • MATH (or MATH-500): The core metric for mathematical reasoning
  • HumanEval: The “de facto standard” for code generation
  • BBH: An important supplement for comprehensive reasoning

These five form the current minimum must-report set for frontier models.

”What’s Not Reported” Analysis

Even more interesting are each player’s strategic omissions:

Claude 3.5 Sonnet doesn’t report AIME or ARC-C. Claude leads by a wide margin on SWE-bench at 49.0% (vs. GPT-4o’s 38.4%), but doesn’t report AIME scores — hinting that it may not have an advantage in ultra-difficult math competitions. It also doesn’t report BFCL, signaling in the function calling dimension that “we focus more on code and reasoning.”

Gemini 1.5 Pro doesn’t report specific scores for some benchmarks, particularly SWE-bench and GAIA. Google’s strategy emphasizes multimodality and long context (million-token context window) rather than competing head-to-head on text benchmarks. However, the Gemini 1.5 technical report does include evaluation data for BFCL and IFEval.

GPT-4o has the most comprehensive coverage — it reports scores for nearly all mainstream benchmarks, reflecting OpenAI’s confidence as the industry standard-bearer: there are no weaknesses that “need to be hidden.”

Llama 3.1 405B, as an open-source model, reports a very comprehensive set of scores, but is missing AIME and GAIA. Open-source models have a unique advantage: even if you don’t report a score, the community will run the benchmarks for you.

Key insight: When you see a model’s evaluation report, first count which benchmarks it reports, then think about which ones it didn’t. The omissions are information in themselves.

Small Model Evaluation: Different Rules of the Game

Frontier models and small models (≤10B parameters) face different levels of competition. Small model evaluation has several key differences:

1. More Conservative Benchmark Selection

Small models typically don’t report benchmarks like SWE-bench or GAIA that require complex multi-step reasoning — not because they’re hiding anything, but because these tasks are too difficult for small models, and reporting single-digit scores would have no reference value.

2. Different Competitors

The comparison targets for small models are other small models in the same tier, not GPT-4o. So in the Gemma 2 9B report, you’ll see comparisons with Llama 3 8B and Mistral 7B, not with Claude 3.5 Sonnet.

3. “Efficiency Ratio” as the Core Narrative

The selling point for small models isn’t having the highest absolute scores, but rather “achieving a 30B model’s performance with only 9B parameters.” So the evaluation focuses on:

  • Who’s best within the same parameter tier
  • How much improvement there is over the previous generation at the same tier
  • Which tasks have the best “performance-per-parameter”

4. Deployment Scenario-Driven

Small models care more about practical usability on edge devices and endpoints. So some reports will additionally test:

  • Inference speed and memory usage
  • Accuracy retention after quantization
  • Performance in specific languages or domains

Full Comparison of Small Models from Major Vendors

The interactive matrix below shows 9 representative models (4 frontier + 5 small) across 14 mainstream benchmarks. Gray striped cells indicate that the model did not report this score — pay special attention to these blank areas.

模型 × Benchmark 热力矩阵

知识
推理
代码
智能体
偏好
MMLU
MMLU-Pro
IFEval
GPQA Diamond
MATH-500
AIME 2024
BBH
ARC-C
HumanEval
SWE-bench Verified
LiveCodeBench
BFCL
GAIA
Arena ELO
Frontier 模型
GPT-4o
88.7
72.6
84.3
56.1
76.6
9.3
83.6
96.4
90.2
38.4
N/R
88.5
40.5
1285
Claude 3.5 Sonnet
88.7
78
88
65
78.3
N/R
93.1
N/R
92
49
N/R
N/R
N/R
1271
Gemini 1.5 Pro
85.9
69
N/R
46.2
67.7
N/R
89.2
N/R
84.1
N/R
N/R
N/R
N/R
1260
Llama 3.1 405B
87.3
73.3
88.6
50.7
73.8
N/R
85.9
96.9
89
33.2
N/R
N/R
N/R
1253
小模型 (≤10B)
Gemma 2 9B
71.3
N/R
N/R
N/R
36.6
N/R
68.2
68.4
40.2
N/R
N/R
N/R
N/R
1187
Phi-3 Mini 3.8B
70.9
N/R
N/R
30.6
N/R
N/R
73.5
86.3
57.3
N/R
N/R
N/R
N/R
N/R
Qwen 2.5 7B
74.2
56.3
71.2
36.4
75.5
N/R
70.4
N/R
84.8
N/R
N/R
N/R
N/R
N/R
Llama 3.1 8B
69.4
48.3
80.4
30.4
51.9
N/R
64.2
83.4
72.6
N/R
N/R
N/R
N/R
1176
Mistral 7B
62.5
N/R
N/R
N/R
N/R
N/R
N/R
78.5
32.9
N/R
N/R
N/R
N/R
1072
列内高分
列内低分
N/R = 未报告(暗示弱项)
每列独立归一化,颜色仅反映列内相对排名

Usage tip: Hover over any cell to see the exact score and source. Click the toggle button to group by model family for side-by-side comparison of large and small models within the same family. Note that the color in each column is independently normalized — green indicates a relatively high score within that column.

Gemma 2 9B (Google)

Google’s Gemma 2 reports a relatively traditional set of benchmarks: MMLU, ARC-C, BBH (BIG-Bench Hard), HumanEval, and MATH. Notably, Gemma does not report MMLU-Pro, GPQA Diamond, or IFEval. Its HumanEval score of only 40.2% is significantly lower than the similarly-sized Qwen 2.5 7B (84.8%) and Llama 3.1 8B (72.6%) — code generation is a clear weakness for Gemma 2 9B.

Phi-3 Mini 3.8B (Microsoft)

Microsoft’s Phi-3 achieves an impressive MMLU of 70.9% with only 3.8B parameters — nearly matching the 9B-class Gemma 2 (71.3%). Phi-3’s ARC-C of 86.3% is the highest among small models, and its BBH of 73.5% is also notable. However, Phi-3 does not report MMLU-Pro, IFEval, or MATH-500, and its HumanEval is only 57.3%. Microsoft’s narrative is “achieving high-quality reasoning with minimal parameters,” but the code and instruction-following dimensions are intentionally downplayed.

Qwen 2.5 7B (Alibaba)

Qwen 2.5 is one of the most comprehensively reported small models — covering MMLU, MMLU-Pro, IFEval, GPQA Diamond, MATH-500, BBH, and HumanEval. Particularly outstanding is MATH-500 at 75.5%, far ahead among small models and even approaching frontier-model levels (GPT-4o is 76.6%). HumanEval at 84.8% is also the highest among small models. Qwen’s weakness is GPQA Diamond (34.2%), but at least it chose to report rather than hide the score.

Llama 3.1 8B (Meta)

Meta’s Llama 3.1 8B benefits from the open-source ecosystem and is the most thoroughly tested model by third parties. The official report covers MMLU, MMLU-Pro, GPQA Diamond, MATH-500, BBH, HumanEval, IFEval, and ARC-C — nearly the most comprehensive coverage among small models. Scores are balanced but without particularly outstanding areas: MMLU at 69.4% (only above Mistral 7B’s 62.5%), though IFEval at 80.4% is relatively strong.

Mistral 7B (Mistral AI)

As an earlier model, Mistral 7B reports the fewest benchmarks — only MMLU (62.5%), ARC-C (78.5%), and HumanEval (32.9%). The large number of “N/R” entries reflects the fact that when Mistral 7B was released (September 2023), the benchmark standard set had not yet formed. This also illustrates the temporal evolution of the standard set: scores that didn’t need to be reported in 2023 had become mandatory by 2024.

Three Major Pitfalls of Score Comparability

When comparing numbers in the matrix, there are several critical comparability issues to be aware of:

1. Prompt Template Differences

The same benchmark can yield 3-5% score differences depending on the prompt template. For example, the classic MMLU question format:

The following is a multiple choice question...
A. ...  B. ...  C. ...  D. ...
Answer:

But some vendors add a system prompt, some adjust the option format, and some use a chat template instead of a raw prompt. The importance of HuggingFace’s Open LLM Leaderboard lies precisely in the fact that it standardizes the evaluation pipeline.

2. Inconsistent Few-Shot Counts

MMLU has two common protocols — 0-shot and 5-shot — with score differences of 3-8%. When you see “MMLU: 88%,” you must verify which protocol was used. The scores in our matrix use each model’s official reported values and protocols wherever possible, but protocols across different models may not be fully consistent — this is also why our colors represent relative ranking within each column, not absolute value comparisons.

3. Evaluation Tool Versions

Different versions of lm-evaluation-harness implement certain benchmarks differently. In particular, the upgrade from Harness v0.3 to v0.4 changed the prompt templates for multiple tasks. If you want to make strict apples-to-apples comparisons, be sure to use the same version of the harness and re-evaluate from scratch.

Practical advice: Don’t obsess over third-decimal-place score differences. If two models differ by less than 2% on a given benchmark, they can essentially be considered “on par.” What truly matters is differences of 5% or more and the overall coverage of capabilities.

Looking Ahead: The Future of the Standard Set

The current benchmark standard set is still evolving rapidly. Several trends worth watching:

  1. MMLU’s exit: Due to data quality issues and saturation effects, MMLU is being replaced by MMLU-Pro. Open LLM Leaderboard v2 has already swapped MMLU for MMLU-Pro
  2. The rise of agent evaluation: As LLMs move from “answering questions” to “executing tasks,” benchmarks like BFCL, GAIA, and SWE-bench are becoming increasingly important
  3. Dynamic benchmarks becoming standard: The dynamic update strategies of LiveCodeBench and LiveBench are becoming the standard approach for contamination prevention
  4. Multimodal expansion: Multimodal benchmarks like MMMU and MathVista are starting to appear in technical reports

Next Steps

This article established a comprehensive understanding of model release benchmarks. The next article, Impact of Optimization on Accuracy, examines benchmark scores from a different angle — when we apply optimizations like quantization and distillation to models, the accuracy loss varies dramatically across different benchmarks. This is critical for choosing edge deployment strategies.