Code Benchmarks | LLM Learning

Opening: Models Claim They Can Write Code — From Function Completion to Fixing GitHub Issues, How Are These Abilities Evaluated?

When OpenAI released Codex and claimed “HumanEval pass@1 of 28.8%,” code generation evaluation gained its standard paradigm. Three years later, as Anthropic and OpenAI fiercely compete on SWE-bench Verified with top systems exceeding 70% resolved rate, the meaning of evaluation has evolved from “can it write a function” to “can it fix bugs in real open-source projects.”

In the previous article, we dove into knowledge and reasoning benchmarks. Now we enter code evaluation — the dimension of LLM evaluation closest to “real-world productivity.” This article will answer:

From HumanEval to SWE-bench, how have code benchmarks evolved?
What does pass@k actually mean? How is it calculated?
HumanEval is already saturated — why can HumanEval+ still find problems?
What exactly does SWE-bench’s agent framework evaluate? Why does the same model score so differently under different frameworks?

Evolution: From Function Completion to Project-Level Software Engineering

The evolution of code benchmarks follows a clear throughline: evaluation granularity has continuously increased.

In 2021, HumanEval and MBPP established the evaluation standard for function-level code completion. The model receives a function signature and docstring, generates the function body, and correctness is verified through test cases. This is the most basic “can it write code?” test.

In 2023, HumanEval+ discovered that the original HumanEval’s test cases were severely insufficient — after adding approximately 80x more test cases, many “passing” code samples turned out to be incorrect. This revealed a key issue: evaluation quality depends on test quality.

In 2024, the evaluation dimension underwent a qualitative leap. SWE-bench elevated granularity from individual functions to entire code repositories: models need to understand real open-source projects, locate bugs, and generate cross-file fixes. That same year, LiveCodeBench and BigCodeBench extended evaluation coverage from two different directions — dynamic updates and complex API calls, respectively.

代码 Benchmark 演进时间线

点击节点查看详情，连线表示继承关系

Interactive guide: Click on any benchmark node to see details. Lines indicate inheritance relationships — for example, HumanEval+ inherits from HumanEval, and BigCodeBench inherits MBPP’s “broad coverage” philosophy. Note the explosive growth in 2024: four new benchmarks appeared simultaneously, reflecting the community’s strong demand for more realistic evaluation.

The Divergence of Evaluation Methods

Code evaluation methods have gone through three generations:

First Generation: Text Matching (Obsolete)

Early attempts used BLEU (a machine translation metric) to evaluate code. The problem is obvious: for i in range(n) and for j in range(n) differ greatly in text but are functionally identical. Text matching cannot measure the semantic correctness of code and has been abandoned by the field.

Second Generation: Execution-Based Verification (Current Mainstream)

HumanEval pioneered the “execution-based verification” paradigm: generated code runs test cases in a sandboxed environment, and it’s considered correct if all tests pass. The core advantage of this approach is that it’s objective, reproducible, and style-independent.

The Meaning and Calculation of pass@k

pass@k is the most important metric in code evaluation, but it’s frequently misunderstood.

Intuitive understanding: The model generates k candidate solutions for the same problem; as long as at least one passes all test cases, it counts as a “pass.”

Formal definition (from the HumanEval paper, Chen et al., 2021):

\text{pass@}k = \mathbb{E}_{\text{problems}} \left[ 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \right]

Here $n$ is the total number of samples per problem ( $n \geq k$ ), and $c$ is the number that pass the tests. This formula uses an unbiased estimator to replace naive repeated sampling, avoiding high-variance issues.

Practical meaning:

pass@1: The probability of the model getting it right on the first try. This is the strictest metric and the most commonly reported today
pass@10: At least one correct out of 10 attempts. Reflects the model’s “upper bound” capability
pass@100: At least one correct out of 100 attempts. Commonly reported in early papers, now less meaningful — because pass@1 is already sufficiently high

Third Generation: Repo-Level Test Verification (Emerging)

SWE-bench elevates verification from “running custom test cases” to “running the project’s own test suite.” This is closer to real development: instead of evaluators writing tests, the project’s own tests determine whether a fix is correct.

Multi-Language Coverage

Early code benchmarks were almost exclusively Python. This poses two problems:

Language bias: A model’s capability in Python may not generalize to other languages
Training data bias: If a model is trained most heavily on Python code, Python benchmarks will overestimate its overall coding ability

MultiPL-E (Cassano et al., 2023) translated HumanEval and MBPP into 18 programming languages (including Java, C++, Rust, Go, TypeScript, etc.), revealing cross-language gaps. For example, some models achieve over 80% Python pass@1 but only 40% in Rust.

Aider Polyglot (community-driven) tests multi-language capabilities in more practical code editing scenarios, using Exercism platform exercises covering multiple programming languages and testing language-specific idioms.

Deep Dive 1: HumanEval

Why HumanEval as a Deep Dive Subject?

HumanEval is the “genesis” benchmark of code evaluation. It defined the pass@k paradigm, and virtually all subsequent code benchmarks have inherited its evaluation methodology. Understanding HumanEval’s design, strengths, and weaknesses is foundational to understanding the entire code evaluation spectrum.

Dataset Composition

HumanEval (Chen et al., 2021) contains 164 Python function completion problems, handwritten by OpenAI engineers. Each problem includes:

Component	Description	Example
Function Signature	`def has_close_elements(numbers: List[float], threshold: float) -> bool:`	Complete type annotations
Docstring	Natural language description of function behavior + examples	Includes input/output examples
Test Cases	Approximately 7.7 `assert` statements on average	Covering basic and edge cases

Problem difficulty ranges from simple string operations (“reverse a string”) to those requiring algorithmic thinking (“check if parentheses are balanced”), overall leaning toward introductory level.

Evaluation Process

The model receives the function signature + docstring as the prompt
The model generates the function body (can sample multiple times)
The generated code is concatenated with the function signature into a complete file
Test cases are run in a sandbox
pass@k is computed

Known Limitations

HumanEval’s limitations are very apparent today:

Too few problems: Only 164, leading to high statistical variance. Changing the random seed can cause 2-3% score fluctuation
Single language: Python only, cannot evaluate cross-language ability
Low difficulty: Current top models exceed 90% pass@1, completely saturated
Insufficient tests: An average of only 7.7 test cases per problem, insufficient for thorough correctness verification

HumanEval+ Improvements

The EvalPlus project (Liu et al., 2023) systematically addressed the last issue:

Test case count: Increased from an average of 7.7 to approximately 80x (over 600 per problem), generated automatically using LLM + mutation-based methods
Effect: On HumanEval+, model pass@k scores dropped by 19.3% to 28.9%. A large number of code samples that “passed” on HumanEval revealed errors under stricter testing
Ranking changes: Some models that ranked lower on HumanEval actually surpassed originally higher-ranked models on HumanEval+ — indicating false positives in the original evaluation

The implication is profound: a benchmark’s discriminative power depends on test quality, not problem count.

Deep Dive 2: SWE-bench

Why SWE-bench?

SWE-bench represents the cutting edge of code evaluation — project-level software engineering capability. It no longer tests “can it write a function” but rather “can it fix real bugs like a software engineer.” Between 2024 and 2025, SWE-bench (particularly the Verified subset) has become the most important arena for AI coding capability.

Dataset Composition

SWE-bench (Jimenez et al., 2024) collected 2,294 real GitHub issues and their corresponding pull requests from 12 popular Python open-source projects (Django, Flask, scikit-learn, SymPy, Matplotlib, etc.).

Subset	Count	Characteristics
SWE-bench (Full)	2,294	Complete dataset, containing various difficulty levels
SWE-bench Lite	300	Functional bug fix subset, excluding documentation/refactoring tasks
SWE-bench Verified	500	Human-verified high-quality subset, the current most authoritative reporting standard

SWE-bench Verified was created to address quality issues in the full set: some issue descriptions are unclear, test coverage is incomplete, or external dependencies are required. Human verification ensures that each instance is “fairly evaluable.”

Evaluation Process

SWE-bench’s evaluation process is far more complex than HumanEval’s. The flow diagram below shows how an agent handles a SWE-bench instance:

SWE-bench 评估流程

Interactive guide: Click the play button to auto-advance through steps, or manually click individual steps for details. Note step 2 “Search and Locate” — the strategy chosen at this step has a massive impact on the final result and is where different agent frameworks diverge most.

The Role of Agent Frameworks

A key characteristic of SWE-bench is that model scores are heavily dependent on the agent framework.

What is an agent framework (harness)? It is the “scaffolding” connecting the LLM to the code repository:

Defines which tools the model can use (file search, grep, bash commands, etc.)
Specifies the interaction format (how search results are passed, how patches are submitted)
Designs the strategy (search first or locate first, how many files to search, how to handle errors, etc.)

The same model (e.g., Claude 3.5 Sonnet) under different agent frameworks can differ in resolved rate by over 20 percentage points. This means SWE-bench scores actually evaluate the combined capability of “model + agent framework”, not purely the model’s coding ability.

The Current SWE-bench Landscape

As of early 2026, the top performance on SWE-bench Verified has reached approximately 77% (Claude 4.5 Opus + high reasoning configuration). The pace of progress has been remarkable — when SWE-bench Verified launched in mid-2024, the initial best system achieved about 33% (Claude 3.5 Sonnet), and it has more than doubled in just over half a year.

But this number requires careful interpretation:

77% was achieved under the optimal agent framework + high reasoning configuration
Under a simpler agent framework, the same model’s score can drop below 50%
The Verified subset was human-curated, so its difficulty distribution differs from the distribution of bugs in real-world development

Known Controversies

Several noteworthy discussions exist within the SWE-bench community:

Agent framework differences: Leaderboard scores include contributions from framework engineering, making it difficult to isolate pure model capability. Some claims of “model A surpasses model B” may simply reflect framework differences
Data leakage risk: SWE-bench test instances come from public GitHub repositories and could theoretically be covered by model training data. The Verified subset partially mitigates this concern
Metric limitations: Resolved rate is a binary metric (resolved/not resolved) and cannot measure “nearly correct” patches. A patch that’s off by one line and a completely wrong patch receive the same zero score
Repository coverage bias: Only 12 Python repositories, all well-known open-source projects. Generalization to smaller projects, non-Python languages, and private codebases is unknown

Transition: The Intersection of Code and Agent Evaluation

SWE-bench already stands at the intersection of code evaluation and agent evaluation — it tests not just “writing code” but the end-to-end agent capability of “autonomously searching, understanding, deciding, and fixing.”

This trend has become increasingly pronounced since 2024: pure “function completion” has saturated, and new code evaluations increasingly resemble agent evaluations — requiring tool use, multi-step reasoning, and environment interaction.

The next article Agent Benchmark Deep Dive will depart from this intersection, diving into benchmarks specifically designed to evaluate agent capabilities such as BFCL, GAIA, and WebArena, exploring evaluation methods for function calling, web navigation, and complex task planning.