SWE-bench Practical Guide
Updated 2026-04-16
The SWE-bench section in Code Benchmarks, together with the SWEbenchFlow component, placed SWE-bench in the landscape of code evaluation — a jump from HumanEval-style function completion to project-level real-world GitHub issue repair, a paradigm shift in 2023 code evaluation. This article focuses on the practical side: “how the Docker-based evaluation pipeline works, what the patch format requires, how agents integrate, and how to set up the environment.” By the end, you should be able to run a SWE-bench evaluation, understand how each instance is judged, and diagnose common failure modes.
1 The Evaluation Pipeline
A 5-stage flow:
1.1 Input Preparation
Load the dataset from HuggingFace. Two namespaces exist: SWE-bench/SWE-bench_Verified is the new primary repository; princeton-nlp/SWE-bench_Verified is the legacy namespace (still accessible). Each instance carries key fields:
problem_statement— the issue text, a human description of a bug or feature requestbase_commit— git commit hash of the code snapshotrepo— full repo name (e.g.django/django)FAIL_TO_PASS— tests that fail before the fix and should pass afterPASS_TO_PASS— tests that already pass and must continue to pass (regression detection)patch— the gold patch (the actual merged PR patch), kept for reference
1.2 Generating the Patch
The model consumes problem_statement plus repository context and outputs a patch in unified diff format. The format must be strictly valid or downstream apply will fail.
1.3 Docker Container Build
Each instance has a dedicated test spec, used to build a Docker image that contains the Python version, library dependencies, and test harness for that commit. SWE-bench fully migrated to containerized evaluation in June 2024, definitively solving the “it works on my machine” reproducibility problem. Per-instance images range from hundreds of MB to several GB — full evaluations consume enormous disk space.
1.4 Three-Level Patch Apply
Inside the container, three apply strategies are tried in sequence, with increasing leniency:
git apply --verbose— strict; patch must match exactlygit apply --verbose --reject— partial hunks allowed, failures go to.rejpatch --batch --fuzz=5 -p1 -i— fuzzy match, tolerates 5 lines of context drift
Only when all three fail does the patch get judged as apply failure. This leniency cascade tolerates minor whitespace or line-number offsets in model patches.
1.5 Test Execution and Judgment
Inside the container, eval.sh runs both the FAIL_TO_PASS and PASS_TO_PASS test groups. Resolved requires:
- All
FAIL_TO_PASSpass (the original bug is fixed), AND - All
PASS_TO_PASSstill pass (nothing existing was broken)
Any failure on either side means unresolved. The most common failure mode is “fixed the bug but broke an existing test” — a PASS_TO_PASS regression.
Click through the pipeline to explore each stage — stage 4 reveals the three-level apply decision tree, stage 5 shows the FAIL_TO_PASS / PASS_TO_PASS result matrix and how it leads to a Resolved verdict.
Why Docker? It solves environment variance. Different instances need different Python versions, different numpy/scipy releases, different system libraries. Without containers, there’s no reproducible, fair evaluation.
2 Dataset Variants and Selection
Five mainstream variants today:
| Variant | Instances | Characteristics | Best Fit |
|---|---|---|---|
| Full | 2,294 (test) | 12 Python repos, original collection | Comprehensive eval, paper reporting |
| Lite | 300 | Curated subset of Full | Fast iteration, low cost |
| Verified | 500 | OpenAI-validated solvability, description clarity, test sanity | Most reliable comparison baseline |
| Multilingual | 300 | 9 programming languages | Cross-language code evaluation |
| Multimodal | 510 (test) | Issues with screenshots / UI bugs | Visual understanding + code repair |
Selection advice:
- Daily iteration — use Lite. 300 instances, hours to run, quick feedback on prompt / agent changes.
- Serious comparison — use Verified. Community-consensus benchmark, human-validated, most trustworthy SOTA reference.
- Paper reporting — report both Full and Verified. Full conveys scale, Verified conveys quality.
instance_id naming — {repo_owner}__{repo_name}-{PR_number} (note the double underscore). For example, django__django-16379 refers to PR #16379 in the django repo.
3 Patch Generation and Agent Integration
3.1 Prediction JSON Format
Model outputs are collected into a JSONL, one line per instance:
{
"model_name_or_path": "my-model",
"instance_id": "django__django-16379",
"model_patch": "--- a/django/db/models/query.py\n+++ b/django/db/models/query.py\n@@ -1234,7 +1234,7 @@\n..."
}
All three fields are required. Note the third field is model_patch, not prediction (a frequent typo). The value must be a valid unified diff string.
3.2 SWE-agent: The Reference Agent
Released alongside SWE-bench, SWE-agent’s design shaped the whole space:
- ACI (Agent-Computer Interface) — tools tailored for LM use, not raw bash. Raw bash is noisy for LMs (
lscan print hundreds of lines, error messages are verbose); the ACI defines compactopen/edit/scroll/searchprimitives. - Agent loop — three-step cycle: Observe (inspect files, error info) → Think (LM reasoning) → Act (edit, search, run tests).
- Config-driven YAML — behavior, tool set, prompt templates all live in YAML; code just executes.
- Status today — SWE-agent has entered maintenance mode; active development moved to
mini-swe-agent.
3.3 mini-swe-agent: Simplicity Wins
mini-swe-agent is the SWE-agent team’s successor: the core agent class is about 100 lines of Python — an order of magnitude simpler than SWE-agent. It hits >74% on SWE-bench Verified with Gemini 3 Pro, evidence that “a carefully minimal design + a strong LM” beats “a complex agent framework + a weaker LM.” It is now the main development line.
3.4 Rolling Your Own Agent
SWE-bench evaluation and agent implementation are decoupled. Any framework (LangChain, AutoGen, your own) works — just emit prediction JSONL in the required format. The evaluator doesn’t care how the agent works internally; it only checks whether the final patch applies and passes tests.
4 Environment Setup and Common Pitfalls
4.1 Hardware Requirements
| Resource | Recommendation |
|---|---|
| Disk | 120GB+ (Docker images accumulate; Full evaluation needs 200GB+) |
| Memory | 16GB minimum, 32GB recommended |
| CPU | 8 cores recommended |
| Docker | Required, Linux containers |
4.2 ARM / Apple Silicon
Default image pulls are x86_64. On ARM you need to build locally:
python -m swebench.harness.run_evaluation \
--namespace '' # empty namespace → build locally instead of pulling from registry
--predictions_path preds.jsonl \
--run_id my_run
ARM support remains experimental; some instances’ dependencies may fail to build on ARM.
4.3 Common Pitfalls (5)
- Disk fills up — Docker images can accumulate to hundreds of GB. Regularly run
docker system prune -ato clean stopped containers and unused images. - Invalid patch format — non-conformant unified diffs fail silently (no error, just judged unresolved). During development, use
git apply --checklocally to validate patches before submission. - PASS_TO_PASS regression — the single most common failure. Fixed the bug, broke an existing test. Before editing, your agent should run key PASS_TO_PASS tests; re-run them after editing to catch regressions.
- Instance creation paused — maintainers have paused the custom-instance creation feature (constructing new eval instances from new PRs). Only official datasets are available for now.
- Evaluation timeout — some instances’ test suites are slow (parts of sympy, matplotlib can take 10+ minutes). Configure evaluator timeouts generously or risk false timeouts.
4.4 Alternative: sb-cli
If you’d rather not maintain a local Docker environment, use the official sb-cli — a cloud evaluation service running on AWS. Submit your prediction JSONL and get results back in minutes to hours. A fit for small teams, one-off evaluations, or shaky local Docker setups.
5 Summary
- Interpreting resolved rate — current Verified SOTA is >74% (mini-swe-agent + Gemini 3 Pro); Full SOTA is lower (instance quality is uneven, some bugs are hard even for humans to reproduce). Always name the specific variant when reporting scores.
- Complement to function-level benchmarks — HumanEval / MBPP measure “can it write code?”; SWE-bench measures “can it do software engineering?” — a completely different capability dimension (project-level code understanding, multi-file edits, regression control).
- Limitations — Full / Lite / Verified are Python-only; issue quality varies; some instances require domain knowledge (numerical libs, web frameworks). For cross-language evaluation, use Multilingual.
For the broader landscape of code evaluation, see Code Benchmarks. For a systematic view of agent capability levels (single tool call to multi-step agent), see Agent Benchmarks.