Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Impact of Optimization on Accuracy

Impact of Optimization on Accuracy

Updated 2026-04-16

Opening: 2-4x Speedup — But How Much Accuracy Do You Actually Lose?

Quantization, pruning, KV cache compression — these optimization techniques promise 2-4x inference speedups and halved memory usage for LLMs. But at what cost? How much accuracy do you actually lose? More importantly: how do you verify this yourself?

This isn’t a simple question. With the same INT4 quantization, a 7B model might lose only 3 percentage points on MMLU, but plummet 16% on HumanEval. The same quantization method might be barely noticeable on a 70B model but could impact practical usability on a 7B model.

This article focuses on evaluation methodology — we won’t cover the internals of quantization algorithms (that belongs in the quantization path), but instead answer three core questions:

  1. How large is the accuracy cost of different optimization techniques? The sensitivity differences across benchmarks are striking
  2. How do you measure it yourself? A detailed look at three toolchains: lm-evaluation-harness, OpenVINO, and llama.cpp
  3. Is perplexity enough? When it’s a good metric, and when it can mislead you

Scope note: This article belongs to the evaluation path and focuses on “how to measure accuracy loss.” For the internals of quantization algorithms themselves (GPTQ, AWQ, GGUF quantization types, etc.), see the dedicated articles in the quantization path.

The Full Landscape of Optimization Techniques and Accuracy Costs

Before diving into evaluation methods, let’s establish a panoramic view — the typical accuracy cost ranges for mainstream optimization techniques:

Optimization TypeTypical MethodsTypical Accuracy LossSensitivity Notes
Weight-only QuantizationGPTQ, AWQ, bitsandbytes NF4INT8: <1%; INT4: 1-5%Code/math tasks are most sensitive
Weight + Activation QuantizationSmoothQuant, ZeroQuantINT8: 0.5-2%; W4A8: 2-8%Activation quantization adds extra loss
GGUF Quantizationllama.cpp Q4_K_M, Q3_K_M, etc.Q4: 1-3%; Q2: 8-20%Degradation accelerates sharply below Q4
FP8 QuantizationFP8 E4M3/E5M2<0.5%Nearly lossless, but requires hardware support
KV Cache QuantizationKV cache INT8/INT40.1-2%Greater impact in long-context scenarios
SparsificationSparseGPT, Wanda10-50% sparse: 1-5%Non-linear degradation at high sparsity rates

Key insight: These are typical ranges, not fixed values. Actual degradation depends on model architecture, scale, calibration data, and the specific task you care about. This is precisely why you need to measure empirically.

Two crucial patterns:

  1. Larger models are more quantization-resilient: 70B models have far more parameter redundancy than 7B models, so quantization noise has less impact on overall output
  2. Task type determines sensitivity: Code generation (requires precise syntax) > math reasoning (requires precise computation chains) > knowledge QA (more tolerant of errors)

The Unevenness of Degradation

The “typical accuracy loss” mentioned above is an average — but in reality, degradation varies enormously across different benchmarks. This is the most commonly overlooked pitfall when choosing evaluation metrics.

The interactive chart below lets you explore this unevenness yourself:

量化精度退化探索器

选择模型规模和视图模式,探索不同量化方法对各 benchmark 的影响

数据为基于文献和社区评测的代表性趋势值,具体分数因模型版本和评测条件而异

模型规模:
视图:
0204060MMLU (知识)6463.5-0.8%61-4.7%63.7-0.5%MMLU-Pro (知识)3534.5-1.4%32-8.6%34.8-0.6%GSM8K (数学)5251-1.9%47-9.6%51.5-1.0%MATH (数学)1817.5-2.8%15-16.7%17.7-1.7%HumanEval (代码)6259-4.8%52-16.1%61-1.6%分数7B 模型 — 各 Benchmark 量化退化

关键发现

  • 敏感度排序:代码类 > 数学类 > 知识类 — 代码生成对量化最为敏感
  • 大模型更耐量化:70B 模型在 INT4 下精度损失远小于 7B(冗余参数提供了缓冲)
  • 当前视图 (7B): 最大退化 16.7% 出现在 INT4 × MATH
FP16
INT8
INT4
FP8

Why Is Code the Most Sensitive?

Code generation is particularly sensitive to quantization because its tolerance for errors is extremely small:

  • Syntactic rigidity: A missing bracket or wrong indentation makes the code completely non-executable. In knowledge QA, “roughly correct” still earns partial credit; in code, “roughly correct” equals zero
  • Precise token selection: Code generation is highly dependent on subtle differences in token probability distributions — the small probability shifts introduced by quantization can cause the model to select the wrong critical token (e.g., == becoming =)
  • Long-range dependencies: A function’s correctness depends on the precision of all preceding variable declarations and import statements. Quantization errors accumulate over long sequences

Why Are Larger Models More Quantization-Resilient?

This pattern is very clear in the data (switch to 70B vs. 7B comparison to see it):

  • Parameter redundancy: Larger models have more “backup channels,” making it easier for quantization errors in individual weights to be compensated by other channels
  • Smoother loss surface: Weight distributions in larger models tend to be smoother with a lower proportion of outliers, making them more quantization-friendly
  • Practical implication: If your task is extremely accuracy-sensitive (e.g., medical code generation), prefer a large model + moderate quantization (e.g., INT8) over a small model + no quantization

Deep Dive 1: lm-evaluation-harness Hands-On Workflow

lm-evaluation-harness (commonly called lm-eval) is currently the most mainstream LLM evaluation framework, maintained by EleutherAI, covering 60+ standard benchmarks. It’s the go-to tool for comparing accuracy before and after quantization.

5-Step Hands-On Workflow

Step 1: Installation

pip install lm-eval
# Or install from source (to get the latest benchmarks)
pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git

Step 2: Evaluate the FP16 Baseline

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
  --tasks mmlu,gsm8k,humaneval \
  --device cuda:0 \
  --batch_size auto

Step 3: Evaluate the Quantized Version

# GPTQ quantized model (using a community-quantized version as an example)
lm_eval --model hf \
  --model_args pretrained=<your-gptq-model-path> \
  --tasks mmlu,gsm8k,humaneval \
  --device cuda:0 \
  --batch_size auto

# GGUF quantized model
lm_eval --model hf \
  --model_args pretrained=/path/to/gguf_folder,gguf_file=model-Q4_K_M.gguf,tokenizer=meta-llama/Llama-3.1-8B-Instruct \
  --tasks mmlu,gsm8k,humaneval

Note: GGUF models require a separately specified tokenizer path; otherwise, lm-eval will attempt to reconstruct the vocabulary from the GGUF file — this can take hours and produce inaccurate results.

Step 4: Compare Results

# lm-eval outputs in JSON format; compare manually or use a script
# Key fields: results -> task_name -> metric_name

Step 5: Multiple-Run Validation

For critical decisions (such as production deployment), it’s recommended to run at least 3 times and average — some benchmarks (especially HumanEval’s pass@k) have high variance.

Backend Integration

lm-eval supports multiple inference backends:

BackendCommand ArgumentUse Case
HuggingFace Transformers--model hfDefault, broadest support
vLLM--model vllmHigh-throughput evaluation, supports tensor parallelism
SGLang--model sglangFP8/INT4/AWQ/GPTQ quantization
OpenVINO--model openvinoIntel platform optimization
API--model openaiCommercial API models

Common Pitfalls

  1. Prompt format sensitivity: The same benchmark can yield 5-10% score differences under different prompt templates. Ensure the baseline and quantized versions use the exact same prompt configuration
  2. Few-shot count: The standard for MMLU is 5-shot, GSM8K is typically 8-shot. Changing the few-shot count will significantly affect results
  3. Temperature setting: HumanEval’s pass@k is extremely sensitive to temperature. The optimal temperature for pass@1 is 0.2 (Codex paper), and 0.8 for pass@100
  4. Batch size impact: Some quantization methods are sensitive to batch size — ensure comparison experiments use the same batch size

Deep Dive 2: OpenVINO Accuracy Evaluation Toolchain

For Intel hardware users, OpenVINO provides a complete accuracy evaluation and optimization toolchain. Its unique advantage is accuracy-aware quantization — actively constraining accuracy loss during the quantization process.

Optimum Intel Conversion and Evaluation

Optimum Intel is a library co-developed by HuggingFace and Intel, providing seamless conversion from HuggingFace models to OpenVINO format:

from optimum.intel import OVModelForCausalLM

# Export to OpenVINO IR format
model = OVModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    export=True
)
model.save_pretrained("./llama-3.1-8b-ov")

# INT8 quantized export
model = OVModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    export=True,
    load_in_8bit=True
)

NNCF Accuracy-Aware Quantization

NNCF (Neural Network Compression Framework)‘s accuracy-aware quantization is the core differentiating feature of the OpenVINO toolchain.

Important: NNCF does not provide any built-in accuracy metricsvalidation_fn is a required parameter, and you must define how “accuracy” is measured yourself. This function takes (model, validation_dataset) as arguments and returns a numeric value (higher values indicate better model performance):

import nncf
from nncf import DropType

# You must define your own accuracy measurement function
def validate_fn(model, validation_dataset):
    correct = 0
    total = 0
    for data in validation_dataset:
        # Run inference with the model, compute accuracy (or any metric you care about)
        output = model(data["input"])
        correct += (output == data["label"])
        total += 1
    return correct / total  # Higher is better

# Accuracy-aware quantization: constrain accuracy loss during the quantization process
quantized_model = nncf.quantize_with_accuracy_control(
    model,
    calibration_dataset=calibration_data,
    validation_dataset=validation_data,
    validation_fn=validate_fn,  # Required, no default
    max_drop=0.01,              # Maximum allowed accuracy drop (default: 0.01)
    drop_type=DropType.ABSOLUTE,# ABSOLUTE: absolute drop; RELATIVE: proportional drop
)

The return value of validation_fn is entirely up to you — it can be task accuracy, F1 score, BLEU, or even the negation of perplexity (since perplexity is lower-is-better, but NNCF expects higher-is-better). drop_type controls how max_drop is interpreted: ABSOLUTE means absolute value drop (e.g., original 0.85, quantized must not fall below 0.84); RELATIVE means proportional drop (e.g., must not drop more than 1% of the original value).

Unlike pure post-training quantization (such as GPTQ), NNCF continuously calls your validation_fn during quantization to check accuracy — if any layer’s accuracy drops beyond the threshold after quantization, it automatically falls back that layer to higher precision (e.g., keeping it at FP16).

benchmark_app Performance Evaluation

Quantization affects not only accuracy but also inference speed. OpenVINO’s benchmark_app lets you measure both simultaneously:

# Measure FP16 latency
benchmark_app -m llama-3.1-8b-fp16.xml -d GPU -niter 100

# Measure INT8 latency
benchmark_app -m llama-3.1-8b-int8.xml -d GPU -niter 100

This allows you to build a complete accuracy-speed trade-off curve: How much faster is INT8 compared to FP16? How much accuracy is lost? Is this trade-off worth it for your scenario?

Deep Dive 3: llama.cpp Accuracy Evaluation

llama.cpp has built-in perplexity computation, making it the fastest way to evaluate GGUF quantization quality.

Built-in Perplexity Test

# Compute perplexity on WikiText-2
./llama-perplexity -m model-Q4_K_M.gguf \
  -f wiki.test.raw \
  --ctx-size 512 \
  --chunks 100

Example output:

Final estimate: PPL = 6.3842 +/- 0.0312

GGUF Variant Comparison

A common workflow is comparing different GGUF quantization variants of the same model:

Quantization VariantTypical Size (8B)PerplexityUse Case
F16~16 GB6.24 (baseline)Accuracy reference
Q8_0~8.5 GB~6.25 (+0.2%)Nearly lossless, recommended first choice
Q6_K~6.6 GB~6.26 (+0.3%)High quality, suitable for most scenarios
Q5_K_M~5.7 GB~6.30 (+1.0%)Good balance point
Q4_K_M~4.9 GB~6.38 (+2.2%)Most commonly used; check code tasks
Q3_K_M~3.9 GB~6.55 (+5.0%)Use when memory-constrained; check specific tasks
Q2_K~3.2 GB~7.20 (+15.4%)Only for extremely resource-constrained scenarios

The data above are representative values for Llama 3.1 8B on WikiText-2, based on aggregated community GGUF evaluation results (TheBloke, bartowski, and other HuggingFace repositories). Specific values may vary depending on model version and evaluation conditions.

Impact of KV Cache Quantization

llama.cpp supports KV cache quantization (--cache-type-k q8_0 --cache-type-v q8_0), which introduces a small additional accuracy loss:

  • KV cache INT8: Perplexity increase of approximately 0.1-0.3%, virtually no impact on most tasks
  • KV cache INT4: Perplexity increase of approximately 0.5-1.5%, with more noticeable degradation in long-context scenarios

The accuracy impact of KV cache quantization has an additive relationship with weight quantization — if weights are already Q4_K_M, adding KV cache INT4 on top may push accuracy degradation beyond acceptable limits.

Perplexity vs. Task Accuracy

Perplexity is the easiest accuracy metric to obtain — it runs fast, requires no labeled data, and needs no complex evaluation framework. But it has a critical limitation: the relationship between perplexity and task-specific accuracy is not linear.

Perplexity vs 任务精度变化

Llama 3.1 8B GGUF — 量化等级对 perplexity 和 benchmark 分数的影响

双轴对比:Perplexity 变化率 vs 任务精度变化率分歧区域FP1616-bitppl 6.24Q8_08-bitppl 6.25Q6_K6-bitppl 6.26Q5_K_M5-bitppl 6.3Q4_K_M4-bitppl 6.38Q3_K_M3-bitppl 6.55Q2_K2-bitppl 7.2+0%+4%+8%+12%+16%Perplexity 变化 %-0%-5%-10%-15%-20%任务精度变化 %
Perplexity 变化率
HumanEval 精度变化率

关键发现

  • Q5_K_M 及以上:perplexity 和任务精度基本同步,perplexity 是可靠的质量指标
  • Q4_K_M 以下:任务精度(尤其代码)下降速度远超 perplexity 预期 — 不能仅靠 perplexity 判断
  • Perplexity 是快速筛选工具,不能替代 task-specific 评估

Why Do They Diverge?

Perplexity measures the model’s overall ability to predict the next token — it’s the average prediction loss across all tokens. But different tasks care about different tokens:

  • Knowledge QA (MMLU): The answer is typically a single option letter (A/B/C/D), and only very few tokens are critical. The large volume of “background” tokens in perplexity dilutes the signal from key tokens
  • Math reasoning (GSM8K): Requires generating a complete reasoning chain, where a single error in an intermediate step leads to an incorrect final answer. Perplexity doesn’t specially weight these critical steps
  • Code generation (HumanEval): As mentioned earlier, syntactic precision requirements are extremely high. Perplexity doesn’t penalize “almost correct but syntactically wrong” code any more than other errors

Practical Recommendations

  1. Perplexity is good for quick screening: If perplexity increases by more than 3%, task accuracy has very likely degraded noticeably — you can rule it out immediately
  2. Perplexity is not suitable for fine-grained decisions: The perplexity difference between Q4_K_M and Q5_K_M might be only 1%, but on code tasks they could differ by 2-3 percentage points
  3. Critical scenarios require actual benchmarking: If your application involves code generation or math reasoning, perplexity isn’t enough — you need full evaluation on your target benchmarks

Toolchain Comparison Summary

The three toolchains each have their positioning; the choice depends on your scenario:

精度评估工具链对比

三种主流工具链的定位与适用场景

📊

lm-evaluation-harness

适用场景
通用模型评估,量化前后对比
支持格式
HuggingFace, vLLM, SGLang, GGUF (有限支持)
评估指标
所有 benchmark 分数 (accuracy, F1, pass@k 等)
Benchmark 覆盖
最广:60+ 主流 benchmark,数百子任务
硬件平台
主要 GPU (NVIDIA),CPU 可用但慢
🔧

OpenVINO (Optimum Intel + NNCF)

适用场景
Intel 硬件部署评估,accuracy-aware 量化
支持格式
OpenVINO IR (从 HuggingFace 转换)
评估指标
benchmark 分数 + 吞吐/延迟 (benchmark_app)
Benchmark 覆盖
通过 lm-eval-harness 集成覆盖主流 benchmark
硬件平台
Intel CPU / iGPU / Arc GPU(核心优势)
🦙

llama.cpp perplexity

适用场景
GGUF 量化质量快速检查
支持格式
仅 GGUF
评估指标
Perplexity (WikiText-2 等)
Benchmark 覆盖
仅 perplexity — 不含 task-specific benchmark
硬件平台
CPU / GPU / Apple Silicon(跨平台)

决策指南

需要全面评估多个 benchmark lm-evaluation-harness
在 Intel 平台部署,需要精度+性能联合评估 OpenVINO
快速验证 GGUF 量化质量 llama.cpp perplexity

Best Practices for Combined Use

In real-world projects, these three tools are often not either/or, but used in phases:

  1. Quick screening phase: Use llama.cpp perplexity to quickly eliminate obviously unqualified quantization variants (immediately discard any with perplexity increase >5%)
  2. Detailed evaluation phase: Use lm-eval-harness for full evaluation on target benchmarks (select 2-3 candidate variants)
  3. Deployment validation phase: If targeting Intel platforms, use the OpenVINO toolchain for joint accuracy + performance evaluation

Transition: From “How Much Accuracy Is Lost” to “Which Model Should I Choose”

This article addressed the question of “how to measure accuracy after optimization.” But in real-world model selection, accuracy is only one dimension of the decision — you also need to consider inference speed, memory usage, deployment cost, and the signals from various leaderboards.

The next article, Leaderboards and Model Selection, will systematically analyze the design differences and applicable scenarios of major leaderboards (Open LLM Leaderboard, Chatbot Arena, LiveBench), and build a practical model selection decision framework.