lm-eval-harness Practical Guide

Benchmark Landscape §6 introduced lm-eval-harness as the core of the open-source evaluation ecosystem, covering 200+ tasks. That article explained the tool’s position in the landscape and why it matters. This article focuses on the practical side: how the Task YAML system works, how to wire in any of the 17+ model backends, and how to read the results. By the end, you should be able to run a complete evaluation, understand every field you see, and sidestep common pitfalls.

1 Three-Layer Architecture

lm-eval’s design cleanly separates into three layers with non-overlapping responsibilities:

1.1 Task Layer

YAML-driven task definitions. Each task is a YAML file, using Jinja2 templates to define prompt format and metric_list to specify evaluation metrics. The framework ships with 208+ built-in tasks, auto-registered from directories under lm_eval/tasks/. Tasks can be used standalone (e.g. hellaswag) or grouped (e.g. mmlu contains 57 subtasks).

1.2 Model Layer

A unified abstraction via the LM base class, exposing three core methods:

loglikelihood(context, continuation) — conditional log-probability of a continuation given a context. Used for yes/no judgments, true/false claims, and choice comparison.
loglikelihood_rolling(string) — unconditional perplexity over a full string. Used for pure language-modeling benchmarks like WikiText.
generate_until(context, stop_tokens) — free generation from a context until a stop token. Used for GSM8K, HumanEval, and other tasks needing a complete generated answer.

17+ backend adapters implement this interface: hf (HuggingFace transformers), vllm, openai-chat-completions, anthropic, openvino, gguf, sglang, nemo, and more. From the harness’s perspective, a model is simply a black box that answers these three methods.

1.3 Evaluation Loop

The orchestration layer that glues Task and Model together:

Read the dataset named in the task YAML
Render each sample into a prompt via the Jinja2 templates
Sample few-shot examples from training_split and assemble them
Pack into requests and dispatch to the Model backend
Receive logprobs or generated text
Run the filter pipeline (e.g. CoT extraction)
Compute metrics, aggregate, and emit JSON

1.4 The Four `output_type` Modes

Task YAML uses output_type to tell the Evaluation Loop how to shape requests and score responses. Four modes:

output_type	Purpose	Typical Tasks
`loglikelihood`	Given context, compute logprob of a specified continuation	BoolQ (yes/no), true-false claims
`multiple_choice`	Loglikelihood per choice, pick the highest	MMLU, HellaSwag, ARC
`generate_until`	Free generation, judged via regex / exact match / code execution	GSM8K, HumanEval, TriviaQA
`loglikelihood_rolling`	Perplexity over a full passage, no context	WikiText, Lambada

lm-eval Three-Layer Architecture

Switch output_type to see how the Evaluation Loop changes

multiple_choice|Loglikelihood per choice, argmax

Typical task: HellaSwag, MMLU

task

读取 Task YAML

→

eval

Jinja2 渲染 prompt

→

eval

Few-shot 拼装

→

eval

每个选项构造请求

→

eval

发送到 Model 后端

→

model

hf / vllm / ...

→

model

返回 logprob / 文本

→

eval

选 logprob 最大选项

→

eval

Metric 计算 + 聚合

output_type mode:

The component above shows the full lifecycle of an evaluation request. Switching the output_type at the bottom reveals how the Evaluation Loop changes — for example, multiple_choice issues one loglikelihood request per choice, and generate_until adds a filter stage.

2 Task YAML Deep Dive

Using HellaSwag as our running example, field by field:

task: hellaswag
dataset_path: Rowan/hellaswag
dataset_name: null
output_type: multiple_choice
training_split: train
validation_split: validation
doc_to_text: "{{query}}"           # Jinja2 template rendering prompt from dataset fields
doc_to_target: "{{label}}"         # Correct answer (choice index for MC)
doc_to_choice: "choices"           # Name of the field holding the choice list
num_fewshot: 0
process_docs: !function utils.process_docs   # Python preprocessing function
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
  - metric: acc_norm
    aggregation: mean
    higher_is_better: true
metadata:
  version: 1.0

2.1 Key Fields

doc_to_text / doc_to_target / doc_to_choice — Jinja2 templates with full access to dataset fields, supporting conditionals and loops. HellaSwag’s doc_to_text: "{{query}}" directly pulls the query field.
process_docs — the !function tag references a Python function for preprocessing. HellaSwag’s implementation cleans text (removes brackets and title markers, strips parenthesized content) to produce a cleaner prompt.
metric_list — supports acc, acc_norm (length-normalized), exact_match, bleu, rouge, perplexity, and others. Multiple metrics can coexist.
num_fewshot — number of few-shot examples sampled from training_split. 0 means zero-shot.
filter_list — post-processing pipeline for scenarios like CoT. A typical pipeline is regex (extract the answer) followed by take_first (select the first match).

2.2 Task Groups and Inheritance

Group YAML — the task field lists child tasks; aggregate_metric_list defines aggregation. For example, MMLU bundles 57 subject-specific subtasks into a single group.
include inheritance — a child task can use include: parent_task.yaml to inherit the parent’s config and only override the differences, reducing duplication.

2.3 External Task Registration

--include_path /my/tasks/ auto-scans a directory and registers every YAML inside. Custom tasks require no source changes — just write a YAML (plus a Python utils module if process_docs / custom metrics are needed).

Task YAML Field Explorer

Click a highlighted field line on the left to see its docs

Switch task example:

task: hellaswag
dataset_path: Rowan/hellaswag
dataset_name: null
output_type: multiple_choice
training_split: train
validation_split: validation
doc_to_text: "{{query}}"
doc_to_target: "{{label}}"
doc_to_choice: "choices"
num_fewshot: 0
process_docs: !function utils.process_docs
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
  - metric: acc_norm
    aggregation: mean
    higher_is_better: true
metadata:
  version: 1.0

Click a field line on the left

The three tabs above show typical configs for the three main output_type modes — HellaSwag (multiple_choice), GSM8K (generate_until + filter_list), and WikiText (loglikelihood_rolling). Click any field to see its documentation.

3 Wiring Up Model Backends

Five commonly used backends, with config details. All backends share CLI flags like --tasks, --batch_size, --num_fewshot, --limit; only --model and --model_args differ.

3.1 hf (HuggingFace Transformers) — Reference Baseline

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-3-8B,dtype=float16 \
  --tasks hellaswag --batch_size auto

Default backend, and the numerical reference. Other backends’ accuracy is measured against hf.
dtype: float16 / bfloat16 / float32
parallelize=True: model parallelism (single-node, multi-GPU for large models)
accelerate launch -m lm_eval ...: data parallelism (same model on multiple GPUs, for throughput)

3.2 vllm — High Throughput

lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-3-8B,tensor_parallel_size=2,gpu_memory_utilization=0.8 \
  --tasks mmlu --batch_size auto

Throughput is much higher than hf, suitable for large-scale evaluation (200k+ samples).
Note: vLLM’s numerical results may differ slightly from hf (floating-point precision and continuous batching). Paper-quality results should be reported against hf.
Use scripts/model_comparator.py to pinpoint cross-backend differences.

3.3 openai-chat-completions — API Models

lm_eval --model openai-chat-completions \
  --model_args model=gpt-4o \
  --tasks mmlu --apply_chat_template

Supports the OpenAI family, and Anthropic via --model anthropic-chat-completions.
--apply_chat_template: auto-wrap prompts in chat format. Near-mandatory for RLHF models.
--system_instruction "You are...": inject a system prompt.
Watch rate limits and cost: prefer smaller --batch_size paired with --num_concurrent.

3.4 openvino — Intel Inference

lm_eval --model openvino \
  --model_args pretrained=OpenVINO/llama-3-8b-ov,dtype=int4 \
  --tasks hellaswag

Requires an Optimum-Intel-converted OpenVINO IR model.
Supports accuracy evaluation of INT4/INT8 quantized models — perfect for verifying that quantization did not hurt accuracy.
Very useful for quantization regression testing on Intel CPU / iGPU / NPU.

3.5 gguf — llama.cpp Server

lm_eval --model gguf \
  --model_args base_url=http://localhost:8080 \
  --tasks hellaswag

Connects via a llama.cpp server. The backend is implemented in gguf.py, which also registers ggml as an alias (same implementation, either name works).
Key pitfall: you must explicitly configure a tokenizer, otherwise tokenizer reconstruction can hang indefinitely.

3.6 Multi-GPU Configuration

Model parallelism (when a model doesn’t fit on one GPU): parallelize=True for hf, tensor_parallel_size=N for vllm.
Data parallelism (throughput): accelerate launch --num_processes N -m lm_eval ....
These compose: large model × data parallelism = tensor_parallel_size × multiple processes.

4 Reading Results and Common Pitfalls

4.1 Result JSON Structure

{
  "results": {
    "hellaswag": {
      "acc": 0.7523,
      "acc,stderr": 0.0043,
      "acc_norm": 0.7891,
      "acc_norm,stderr": 0.0041
    }
  },
  "configs": { "hellaswag": { "...": "..." } },
  "versions": { "hellaswag": 1.0 },
  "n-shot": { "hellaswag": 0 },
  "n-samples": { "hellaswag": 10042 }
}

acc vs acc_norm — acc_norm normalizes logprobs by choice length (dividing by token count). On tasks where choices vary widely in length (HellaSwag), acc_norm is fairer.
,stderr — standard error. Used to decide whether a score gap between two models is statistically meaningful (a 0.5% gap with stderr 0.4% is essentially noise).
--log_samples — saves per-sample inputs, outputs, and logprobs. Essential for error analysis.

4.2 Common Pitfalls (7)

Base install lacks torch — v0.4+ ships without torch / transformers. Install the backend extras (e.g. pip install lm-eval[hf]) or the backend’s own dependencies.
vLLM vs HF numerical drift — floating-point precision and continuous batching cause 0.5–2% score differences. Two models both benchmarked on vllm are comparable to each other, but don’t compare them to hf-based numbers from a published paper.
GGUF tokenizer hangs — not specifying a tokenizer can cause reconstruction to hang forever. Always pass tokenizer info in --model_args.
SGLang OOM — --batch_size auto can OOM with SGLang. Set batch size manually and add mem_fraction_static=0.7.
MPS (Apple Silicon) precision — some ops diverge from CPU/CUDA. Cross-validate results with CPU before trusting them.
CoT reasoning traces contaminate scoring — models emitting <think> blocks (like Qwen-QwQ) will feed the thinking into the scorer. Use think_end_token to strip the reasoning chain before scoring.
Task version mismatch — when metadata.version changes, scores are no longer directly comparable (new prompt templates can shift results 1–3%). Always align on version when comparing across runs.

5 Summary

Custom task best practices — run with --limit 5 first to iterate on templates, use --log_samples to verify the rendered prompt and filter extractions, only then run full evaluation.
Python API — lm_eval.simple_evaluate() enables scripted evaluation pipelines, ideal for integrating into CI/CD for regression testing.
Integrations — --wandb_args project=... auto-uploads to Weights & Biases; --hf_hub_log_args repo_id=... pushes to HuggingFace Hub for team sharing.

For this framework’s position in the broader evaluation ecosystem, see Benchmark Landscape §6. For specific benchmarks and difficulty distribution, see Reasoning Benchmarks, Code Benchmarks.

1 Three-Layer Architecture

1.1 Task Layer

1.2 Model Layer

1.3 Evaluation Loop

1.4 The Four output_type Modes

2 Task YAML Deep Dive

2.1 Key Fields

2.2 Task Groups and Inheritance

2.3 External Task Registration

3 Wiring Up Model Backends

3.1 hf (HuggingFace Transformers) — Reference Baseline

3.2 vllm — High Throughput

3.3 openai-chat-completions — API Models

3.4 openvino — Intel Inference

3.5 gguf — llama.cpp Server

3.6 Multi-GPU Configuration

4 Reading Results and Common Pitfalls

4.1 Result JSON Structure

4.2 Common Pitfalls (7)

5 Summary

1.4 The Four `output_type` Modes