Anatomy of Model Release Benchmark Standard Sets
Updated 2026-04-14
Opening: Why Are Releases Always About the Same Benchmarks?
Every time a new model is released, the technical report includes a big table — MMLU score here, HumanEval score there, GPQA Diamond score over there… If you look closely, you’ll notice that the benchmarks different vendors choose to report overlap heavily, yet are never quite identical. Two key questions lurk behind this pattern:
- The shared choices: Why these benchmarks and not others? How did they become the “standard set”?
- The deliberate omissions: When a model doesn’t report a certain score, it’s usually not an oversight — the score just doesn’t look good. What you don’t report is just as telling as what you do.
This article systematically examines the current standard set of benchmarks for model releases, analyzes how evaluation frameworks differ between frontier models and small models, and provides an interactive matrix that lets you see each model’s strengths, weaknesses, and “blank spots” at a glance.
Timeliness disclaimer: This article is based on model releases up to early 2025. The benchmark ecosystem evolves rapidly, and specific scores and popular evaluation suites change over time. Our focus is on the selection logic and analytical methods, not on tracking the latest rankings.
Evolution of the Standard Set: From the “Big Four” to Full Coverage
2023: The “Big Four” Era
In the early days of the LLM boom sparked by ChatGPT, model technical reports typically only needed to report four core benchmarks:
| Benchmark | What It Tests | Why It’s Standard |
|---|---|---|
| MMLU | Knowledge breadth across 57 subjects | The most widely cited knowledge baseline |
| HumanEval | Function-level code generation | The pass@1 paradigm is clean and unambiguous |
| GSM8K | Elementary math reasoning | The “entry exam” for reasoning ability |
| HellaSwag | Commonsense reasoning / language understanding | NLU baseline, high discriminating power in the early days |
These four benchmarks plus Chatbot Arena’s ELO ranking essentially formed a model’s “resume.”
2024: Expansion and Divergence
As model capabilities improved, the original benchmarks began to saturate (ceiling effect). GPT-4-class models hit 90%+ on both MMLU and GSM8K, and discriminating power dropped sharply. In response:
- MMLU → MMLU-Pro: Expanded from 4 options to 10, introduced stronger reasoning requirements, reduced prompt sensitivity from 4-5% to 2%
- GSM8K → MATH-500 / AIME: From elementary math to competition-level, re-opening the gap
- HumanEval → SWE-bench Verified: From function-level to project-level, testing real software engineering ability
- New Agent dimension: BFCL (function calling) and GAIA (multi-step interaction) started appearing
- New IFEval: A dedicated test for instruction-following ability
2025: The Current Standard
As of now, a frontier model release must cover at minimum the following dimensions:
| Dimension | Must-Report Benchmarks | Nice-to-Have |
|---|---|---|
| Knowledge | MMLU / MMLU-Pro | IFEval |
| Reasoning | GPQA Diamond, MATH-500 | AIME, BBH |
| Code | HumanEval, SWE-bench Verified | LiveCodeBench |
| Agent | BFCL or similar | GAIA, WebArena |
| Preference | Chatbot Arena ELO | MT-Bench |
This is the “benchmark standard set” — it wasn’t mandated by any institution but emerged naturally through competitive equilibrium: whatever you don’t report, your competitors will analyze why you didn’t.
Frontier Model Comparison: The Must-Report Intersection and Strategic Omissions
The Shared Must-Report Items Across Four Players
Analyzing the technical reports of four frontier models — GPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Gemini 1.5 Pro (Google), and Llama 3.1 405B (Meta) — we can extract the benchmarks that all four report:
- MMLU (or MMLU-Pro): The “consensus baseline” for knowledge breadth
- GPQA Diamond: The new gold standard for PhD-level reasoning
- MATH (or MATH-500): The core metric for mathematical reasoning
- HumanEval: The “de facto standard” for code generation
- BBH: An important supplement for comprehensive reasoning
These five form the current minimum must-report set for frontier models.
”What’s Not Reported” Analysis
Even more interesting are each player’s strategic omissions:
Claude 3.5 Sonnet doesn’t report AIME or ARC-C. Claude leads by a wide margin on SWE-bench at 49.0% (vs. GPT-4o’s 38.4%), but doesn’t report AIME scores — hinting that it may not have an advantage in ultra-difficult math competitions. It also doesn’t report BFCL, signaling in the function calling dimension that “we focus more on code and reasoning.”
Gemini 1.5 Pro doesn’t report specific scores for some benchmarks, particularly SWE-bench and GAIA. Google’s strategy emphasizes multimodality and long context (million-token context window) rather than competing head-to-head on text benchmarks. However, the Gemini 1.5 technical report does include evaluation data for BFCL and IFEval.
GPT-4o has the most comprehensive coverage — it reports scores for nearly all mainstream benchmarks, reflecting OpenAI’s confidence as the industry standard-bearer: there are no weaknesses that “need to be hidden.”
Llama 3.1 405B, as an open-source model, reports a very comprehensive set of scores, but is missing AIME and GAIA. Open-source models have a unique advantage: even if you don’t report a score, the community will run the benchmarks for you.
Key insight: When you see a model’s evaluation report, first count which benchmarks it reports, then think about which ones it didn’t. The omissions are information in themselves.
Small Model Evaluation: Different Rules of the Game
Frontier models and small models (≤10B parameters) face different levels of competition. Small model evaluation has several key differences:
1. More Conservative Benchmark Selection
Small models typically don’t report benchmarks like SWE-bench or GAIA that require complex multi-step reasoning — not because they’re hiding anything, but because these tasks are too difficult for small models, and reporting single-digit scores would have no reference value.
2. Different Competitors
The comparison targets for small models are other small models in the same tier, not GPT-4o. So in the Gemma 2 9B report, you’ll see comparisons with Llama 3 8B and Mistral 7B, not with Claude 3.5 Sonnet.
3. “Efficiency Ratio” as the Core Narrative
The selling point for small models isn’t having the highest absolute scores, but rather “achieving a 30B model’s performance with only 9B parameters.” So the evaluation focuses on:
- Who’s best within the same parameter tier
- How much improvement there is over the previous generation at the same tier
- Which tasks have the best “performance-per-parameter”
4. Deployment Scenario-Driven
Small models care more about practical usability on edge devices and endpoints. So some reports will additionally test:
- Inference speed and memory usage
- Accuracy retention after quantization
- Performance in specific languages or domains
Full Comparison of Small Models from Major Vendors
The interactive matrix below shows 9 representative models (4 frontier + 5 small) across 14 mainstream benchmarks. Gray striped cells indicate that the model did not report this score — pay special attention to these blank areas.
模型 × Benchmark 热力矩阵
Usage tip: Hover over any cell to see the exact score and source. Click the toggle button to group by model family for side-by-side comparison of large and small models within the same family. Note that the color in each column is independently normalized — green indicates a relatively high score within that column.
Gemma 2 9B (Google)
Google’s Gemma 2 reports a relatively traditional set of benchmarks: MMLU, ARC-C, BBH (BIG-Bench Hard), HumanEval, and MATH. Notably, Gemma does not report MMLU-Pro, GPQA Diamond, or IFEval. Its HumanEval score of only 40.2% is significantly lower than the similarly-sized Qwen 2.5 7B (84.8%) and Llama 3.1 8B (72.6%) — code generation is a clear weakness for Gemma 2 9B.
Phi-3 Mini 3.8B (Microsoft)
Microsoft’s Phi-3 achieves an impressive MMLU of 70.9% with only 3.8B parameters — nearly matching the 9B-class Gemma 2 (71.3%). Phi-3’s ARC-C of 86.3% is the highest among small models, and its BBH of 73.5% is also notable. However, Phi-3 does not report MMLU-Pro, IFEval, or MATH-500, and its HumanEval is only 57.3%. Microsoft’s narrative is “achieving high-quality reasoning with minimal parameters,” but the code and instruction-following dimensions are intentionally downplayed.
Qwen 2.5 7B (Alibaba)
Qwen 2.5 is one of the most comprehensively reported small models — covering MMLU, MMLU-Pro, IFEval, GPQA Diamond, MATH-500, BBH, and HumanEval. Particularly outstanding is MATH-500 at 75.5%, far ahead among small models and even approaching frontier-model levels (GPT-4o is 76.6%). HumanEval at 84.8% is also the highest among small models. Qwen’s weakness is GPQA Diamond (34.2%), but at least it chose to report rather than hide the score.
Llama 3.1 8B (Meta)
Meta’s Llama 3.1 8B benefits from the open-source ecosystem and is the most thoroughly tested model by third parties. The official report covers MMLU, MMLU-Pro, GPQA Diamond, MATH-500, BBH, HumanEval, IFEval, and ARC-C — nearly the most comprehensive coverage among small models. Scores are balanced but without particularly outstanding areas: MMLU at 69.4% (only above Mistral 7B’s 62.5%), though IFEval at 80.4% is relatively strong.
Mistral 7B (Mistral AI)
As an earlier model, Mistral 7B reports the fewest benchmarks — only MMLU (62.5%), ARC-C (78.5%), and HumanEval (32.9%). The large number of “N/R” entries reflects the fact that when Mistral 7B was released (September 2023), the benchmark standard set had not yet formed. This also illustrates the temporal evolution of the standard set: scores that didn’t need to be reported in 2023 had become mandatory by 2024.
Three Major Pitfalls of Score Comparability
When comparing numbers in the matrix, there are several critical comparability issues to be aware of:
1. Prompt Template Differences
The same benchmark can yield 3-5% score differences depending on the prompt template. For example, the classic MMLU question format:
The following is a multiple choice question...
A. ... B. ... C. ... D. ...
Answer:
But some vendors add a system prompt, some adjust the option format, and some use a chat template instead of a raw prompt. The importance of HuggingFace’s Open LLM Leaderboard lies precisely in the fact that it standardizes the evaluation pipeline.
2. Inconsistent Few-Shot Counts
MMLU has two common protocols — 0-shot and 5-shot — with score differences of 3-8%. When you see “MMLU: 88%,” you must verify which protocol was used. The scores in our matrix use each model’s official reported values and protocols wherever possible, but protocols across different models may not be fully consistent — this is also why our colors represent relative ranking within each column, not absolute value comparisons.
3. Evaluation Tool Versions
Different versions of lm-evaluation-harness implement certain benchmarks differently. In particular, the upgrade from Harness v0.3 to v0.4 changed the prompt templates for multiple tasks. If you want to make strict apples-to-apples comparisons, be sure to use the same version of the harness and re-evaluate from scratch.
Practical advice: Don’t obsess over third-decimal-place score differences. If two models differ by less than 2% on a given benchmark, they can essentially be considered “on par.” What truly matters is differences of 5% or more and the overall coverage of capabilities.
Looking Ahead: The Future of the Standard Set
The current benchmark standard set is still evolving rapidly. Several trends worth watching:
- MMLU’s exit: Due to data quality issues and saturation effects, MMLU is being replaced by MMLU-Pro. Open LLM Leaderboard v2 has already swapped MMLU for MMLU-Pro
- The rise of agent evaluation: As LLMs move from “answering questions” to “executing tasks,” benchmarks like BFCL, GAIA, and SWE-bench are becoming increasingly important
- Dynamic benchmarks becoming standard: The dynamic update strategies of LiveCodeBench and LiveBench are becoming the standard approach for contamination prevention
- Multimodal expansion: Multimodal benchmarks like MMMU and MathVista are starting to appear in technical reports
Next Steps
This article established a comprehensive understanding of model release benchmarks. The next article, Impact of Optimization on Accuracy, examines benchmark scores from a different angle — when we apply optimizations like quantization and distillation to models, the accuracy loss varies dramatically across different benchmarks. This is critical for choosing edge deployment strategies.