SWE-bench Practical Guide

The SWE-bench section in Code Benchmarks, together with the SWEbenchFlow component, placed SWE-bench in the landscape of code evaluation — a jump from HumanEval-style function completion to project-level real-world GitHub issue repair, a paradigm shift in 2023 code evaluation. This article focuses on the practical side: “how the Docker-based evaluation pipeline works, what the patch format requires, how agents integrate, and how to set up the environment.” By the end, you should be able to run a SWE-bench evaluation, understand how each instance is judged, and diagnose common failure modes.

1 The Evaluation Pipeline

A 5-stage flow:

1.1 Input Preparation

Load the dataset from HuggingFace. Two namespaces exist: SWE-bench/SWE-bench_Verified is the new primary repository; princeton-nlp/SWE-bench_Verified is the legacy namespace (still accessible). Each instance carries key fields:

problem_statement — the issue text, a human description of a bug or feature request
base_commit — git commit hash of the code snapshot
repo — full repo name (e.g. django/django)
FAIL_TO_PASS — tests that fail before the fix and should pass after
PASS_TO_PASS — tests that already pass and must continue to pass (regression detection)
patch — the gold patch (the actual merged PR patch), kept for reference

1.2 Generating the Patch

The model consumes problem_statement plus repository context and outputs a patch in unified diff format. The format must be strictly valid or downstream apply will fail.

1.3 Docker Container Build

Each instance has a dedicated test spec, used to build a Docker image that contains the Python version, library dependencies, and test harness for that commit. SWE-bench fully migrated to containerized evaluation in June 2024, definitively solving the “it works on my machine” reproducibility problem. Per-instance images range from hundreds of MB to several GB — full evaluations consume enormous disk space.

1.4 Three-Level Patch Apply

Inside the container, three apply strategies are tried in sequence, with increasing leniency:

git apply --verbose — strict; patch must match exactly
git apply --verbose --reject — partial hunks allowed, failures go to .rej
patch --batch --fuzz=5 -p1 -i — fuzzy match, tolerates 5 lines of context drift

Only when all three fail does the patch get judged as apply failure. This leniency cascade tolerates minor whitespace or line-number offsets in model patches.

1.5 Test Execution and Judgment

Inside the container, eval.sh runs both the FAIL_TO_PASS and PASS_TO_PASS test groups. Resolved requires:

All FAIL_TO_PASS pass (the original bug is fixed), AND
All PASS_TO_PASS still pass (nothing existing was broken)

Any failure on either side means unresolved. The most common failure mode is “fixed the bug but broke an existing test” — a PASS_TO_PASS regression.

SWE-bench Evaluation Pipeline

Click any stage to inspect; stage 5 switches result cases

→

✓ RESOLVED(current case: Ideal: bug fixed + no regression)

Click through the pipeline to explore each stage — stage 4 reveals the three-level apply decision tree, stage 5 shows the FAIL_TO_PASS / PASS_TO_PASS result matrix and how it leads to a Resolved verdict.

Why Docker? It solves environment variance. Different instances need different Python versions, different numpy/scipy releases, different system libraries. Without containers, there’s no reproducible, fair evaluation.

2 Dataset Variants and Selection

Five mainstream variants today:

Variant	Instances	Characteristics	Best Fit
Full	2,294 (test)	12 Python repos, original collection	Comprehensive eval, paper reporting
Lite	300	Curated subset of Full	Fast iteration, low cost
Verified	500	OpenAI-validated solvability, description clarity, test sanity	Most reliable comparison baseline
Multilingual	300	9 programming languages	Cross-language code evaluation
Multimodal	510 (test)	Issues with screenshots / UI bugs	Visual understanding + code repair

Selection advice:

Daily iteration — use Lite. 300 instances, hours to run, quick feedback on prompt / agent changes.
Serious comparison — use Verified. Community-consensus benchmark, human-validated, most trustworthy SOTA reference.
Paper reporting — report both Full and Verified. Full conveys scale, Verified conveys quality.

instance_id naming — {repo_owner}__{repo_name}-{PR_number} (note the double underscore). For example, django__django-16379 refers to PR #16379 in the django repo.

3 Patch Generation and Agent Integration

3.1 Prediction JSON Format

Model outputs are collected into a JSONL, one line per instance:

{
  "model_name_or_path": "my-model",
  "instance_id": "django__django-16379",
  "model_patch": "--- a/django/db/models/query.py\n+++ b/django/db/models/query.py\n@@ -1234,7 +1234,7 @@\n..."
}

All three fields are required. Note the third field is model_patch, not prediction (a frequent typo). The value must be a valid unified diff string.

3.2 SWE-agent: The Reference Agent

Released alongside SWE-bench, SWE-agent’s design shaped the whole space:

ACI (Agent-Computer Interface) — tools tailored for LM use, not raw bash. Raw bash is noisy for LMs (ls can print hundreds of lines, error messages are verbose); the ACI defines compact open/edit/scroll/search primitives.
Agent loop — three-step cycle: Observe (inspect files, error info) → Think (LM reasoning) → Act (edit, search, run tests).
Config-driven YAML — behavior, tool set, prompt templates all live in YAML; code just executes.
Status today — SWE-agent has entered maintenance mode; active development moved to mini-swe-agent.

3.3 mini-swe-agent: Simplicity Wins

mini-swe-agent is the SWE-agent team’s successor: the core agent class is about 100 lines of Python — an order of magnitude simpler than SWE-agent. It hits >74% on SWE-bench Verified with Gemini 3 Pro, evidence that “a carefully minimal design + a strong LM” beats “a complex agent framework + a weaker LM.” It is now the main development line.

3.4 Rolling Your Own Agent

SWE-bench evaluation and agent implementation are decoupled. Any framework (LangChain, AutoGen, your own) works — just emit prediction JSONL in the required format. The evaluator doesn’t care how the agent works internally; it only checks whether the final patch applies and passes tests.

4 Environment Setup and Common Pitfalls

4.1 Hardware Requirements

Resource	Recommendation
Disk	120GB+ (Docker images accumulate; Full evaluation needs 200GB+)
Memory	16GB minimum, 32GB recommended
CPU	8 cores recommended
Docker	Required, Linux containers

4.2 ARM / Apple Silicon

Default image pulls are x86_64. On ARM you need to build locally:

python -m swebench.harness.run_evaluation \
  --namespace ''   # empty namespace → build locally instead of pulling from registry
  --predictions_path preds.jsonl \
  --run_id my_run

ARM support remains experimental; some instances’ dependencies may fail to build on ARM.

4.3 Common Pitfalls (5)

Disk fills up — Docker images can accumulate to hundreds of GB. Regularly run docker system prune -a to clean stopped containers and unused images.
Invalid patch format — non-conformant unified diffs fail silently (no error, just judged unresolved). During development, use git apply --check locally to validate patches before submission.
PASS_TO_PASS regression — the single most common failure. Fixed the bug, broke an existing test. Before editing, your agent should run key PASS_TO_PASS tests; re-run them after editing to catch regressions.
Instance creation paused — maintainers have paused the custom-instance creation feature (constructing new eval instances from new PRs). Only official datasets are available for now.
Evaluation timeout — some instances’ test suites are slow (parts of sympy, matplotlib can take 10+ minutes). Configure evaluator timeouts generously or risk false timeouts.

4.4 Alternative: sb-cli

If you’d rather not maintain a local Docker environment, use the official sb-cli — a cloud evaluation service running on AWS. Submit your prediction JSONL and get results back in minutes to hours. A fit for small teams, one-off evaluations, or shaky local Docker setups.

5 Summary

Interpreting resolved rate — current Verified SOTA is >74% (mini-swe-agent + Gemini 3 Pro); Full SOTA is lower (instance quality is uneven, some bugs are hard even for humans to reproduce). Always name the specific variant when reporting scores.
Complement to function-level benchmarks — HumanEval / MBPP measure “can it write code?”; SWE-bench measures “can it do software engineering?” — a completely different capability dimension (project-level code understanding, multi-file edits, regression control).
Limitations — Full / Lite / Verified are Python-only; issue quality varies; some instances require domain knowledge (numerical libs, web frameworks). For cross-language evaluation, use Multilingual.

For the broader landscape of code evaluation, see Code Benchmarks. For a systematic view of agent capability levels (single tool call to multi-step agent), see Agent Benchmarks.