Impact of Optimization on Accuracy
Updated 2026-04-16
Opening: 2-4x Speedup — But How Much Accuracy Do You Actually Lose?
Quantization, pruning, KV cache compression — these optimization techniques promise 2-4x inference speedups and halved memory usage for LLMs. But at what cost? How much accuracy do you actually lose? More importantly: how do you verify this yourself?
This isn’t a simple question. With the same INT4 quantization, a 7B model might lose only 3 percentage points on MMLU, but plummet 16% on HumanEval. The same quantization method might be barely noticeable on a 70B model but could impact practical usability on a 7B model.
This article focuses on evaluation methodology — we won’t cover the internals of quantization algorithms (that belongs in the quantization path), but instead answer three core questions:
- How large is the accuracy cost of different optimization techniques? The sensitivity differences across benchmarks are striking
- How do you measure it yourself? A detailed look at three toolchains: lm-evaluation-harness, OpenVINO, and llama.cpp
- Is perplexity enough? When it’s a good metric, and when it can mislead you
Scope note: This article belongs to the evaluation path and focuses on “how to measure accuracy loss.” For the internals of quantization algorithms themselves (GPTQ, AWQ, GGUF quantization types, etc.), see the dedicated articles in the quantization path.
The Full Landscape of Optimization Techniques and Accuracy Costs
Before diving into evaluation methods, let’s establish a panoramic view — the typical accuracy cost ranges for mainstream optimization techniques:
| Optimization Type | Typical Methods | Typical Accuracy Loss | Sensitivity Notes |
|---|---|---|---|
| Weight-only Quantization | GPTQ, AWQ, bitsandbytes NF4 | INT8: <1%; INT4: 1-5% | Code/math tasks are most sensitive |
| Weight + Activation Quantization | SmoothQuant, ZeroQuant | INT8: 0.5-2%; W4A8: 2-8% | Activation quantization adds extra loss |
| GGUF Quantization | llama.cpp Q4_K_M, Q3_K_M, etc. | Q4: 1-3%; Q2: 8-20% | Degradation accelerates sharply below Q4 |
| FP8 Quantization | FP8 E4M3/E5M2 | <0.5% | Nearly lossless, but requires hardware support |
| KV Cache Quantization | KV cache INT8/INT4 | 0.1-2% | Greater impact in long-context scenarios |
| Sparsification | SparseGPT, Wanda | 10-50% sparse: 1-5% | Non-linear degradation at high sparsity rates |
Key insight: These are typical ranges, not fixed values. Actual degradation depends on model architecture, scale, calibration data, and the specific task you care about. This is precisely why you need to measure empirically.
Two crucial patterns:
- Larger models are more quantization-resilient: 70B models have far more parameter redundancy than 7B models, so quantization noise has less impact on overall output
- Task type determines sensitivity: Code generation (requires precise syntax) > math reasoning (requires precise computation chains) > knowledge QA (more tolerant of errors)
The Unevenness of Degradation
The “typical accuracy loss” mentioned above is an average — but in reality, degradation varies enormously across different benchmarks. This is the most commonly overlooked pitfall when choosing evaluation metrics.
The interactive chart below lets you explore this unevenness yourself:
量化精度退化探索器
选择模型规模和视图模式,探索不同量化方法对各 benchmark 的影响
数据为基于文献和社区评测的代表性趋势值,具体分数因模型版本和评测条件而异
关键发现
- 敏感度排序:代码类 > 数学类 > 知识类 — 代码生成对量化最为敏感
- 大模型更耐量化:70B 模型在 INT4 下精度损失远小于 7B(冗余参数提供了缓冲)
- 当前视图 (7B): 最大退化 16.7% 出现在 INT4 × MATH
Why Is Code the Most Sensitive?
Code generation is particularly sensitive to quantization because its tolerance for errors is extremely small:
- Syntactic rigidity: A missing bracket or wrong indentation makes the code completely non-executable. In knowledge QA, “roughly correct” still earns partial credit; in code, “roughly correct” equals zero
- Precise token selection: Code generation is highly dependent on subtle differences in token probability distributions — the small probability shifts introduced by quantization can cause the model to select the wrong critical token (e.g.,
==becoming=) - Long-range dependencies: A function’s correctness depends on the precision of all preceding variable declarations and import statements. Quantization errors accumulate over long sequences
Why Are Larger Models More Quantization-Resilient?
This pattern is very clear in the data (switch to 70B vs. 7B comparison to see it):
- Parameter redundancy: Larger models have more “backup channels,” making it easier for quantization errors in individual weights to be compensated by other channels
- Smoother loss surface: Weight distributions in larger models tend to be smoother with a lower proportion of outliers, making them more quantization-friendly
- Practical implication: If your task is extremely accuracy-sensitive (e.g., medical code generation), prefer a large model + moderate quantization (e.g., INT8) over a small model + no quantization
Deep Dive 1: lm-evaluation-harness Hands-On Workflow
lm-evaluation-harness (commonly called lm-eval) is currently the most mainstream LLM evaluation framework, maintained by EleutherAI, covering 60+ standard benchmarks. It’s the go-to tool for comparing accuracy before and after quantization.
5-Step Hands-On Workflow
Step 1: Installation
pip install lm-eval
# Or install from source (to get the latest benchmarks)
pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git
Step 2: Evaluate the FP16 Baseline
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
--tasks mmlu,gsm8k,humaneval \
--device cuda:0 \
--batch_size auto
Step 3: Evaluate the Quantized Version
# GPTQ quantized model (using a community-quantized version as an example)
lm_eval --model hf \
--model_args pretrained=<your-gptq-model-path> \
--tasks mmlu,gsm8k,humaneval \
--device cuda:0 \
--batch_size auto
# GGUF quantized model
lm_eval --model hf \
--model_args pretrained=/path/to/gguf_folder,gguf_file=model-Q4_K_M.gguf,tokenizer=meta-llama/Llama-3.1-8B-Instruct \
--tasks mmlu,gsm8k,humaneval
Note: GGUF models require a separately specified tokenizer path; otherwise, lm-eval will attempt to reconstruct the vocabulary from the GGUF file — this can take hours and produce inaccurate results.
Step 4: Compare Results
# lm-eval outputs in JSON format; compare manually or use a script
# Key fields: results -> task_name -> metric_name
Step 5: Multiple-Run Validation
For critical decisions (such as production deployment), it’s recommended to run at least 3 times and average — some benchmarks (especially HumanEval’s pass@k) have high variance.
Backend Integration
lm-eval supports multiple inference backends:
| Backend | Command Argument | Use Case |
|---|---|---|
| HuggingFace Transformers | --model hf | Default, broadest support |
| vLLM | --model vllm | High-throughput evaluation, supports tensor parallelism |
| SGLang | --model sglang | FP8/INT4/AWQ/GPTQ quantization |
| OpenVINO | --model openvino | Intel platform optimization |
| API | --model openai | Commercial API models |
Common Pitfalls
- Prompt format sensitivity: The same benchmark can yield 5-10% score differences under different prompt templates. Ensure the baseline and quantized versions use the exact same prompt configuration
- Few-shot count: The standard for MMLU is 5-shot, GSM8K is typically 8-shot. Changing the few-shot count will significantly affect results
- Temperature setting: HumanEval’s pass@k is extremely sensitive to temperature. The optimal temperature for pass@1 is 0.2 (Codex paper), and 0.8 for pass@100
- Batch size impact: Some quantization methods are sensitive to batch size — ensure comparison experiments use the same batch size
Deep Dive 2: OpenVINO Accuracy Evaluation Toolchain
For Intel hardware users, OpenVINO provides a complete accuracy evaluation and optimization toolchain. Its unique advantage is accuracy-aware quantization — actively constraining accuracy loss during the quantization process.
Optimum Intel Conversion and Evaluation
Optimum Intel is a library co-developed by HuggingFace and Intel, providing seamless conversion from HuggingFace models to OpenVINO format:
from optimum.intel import OVModelForCausalLM
# Export to OpenVINO IR format
model = OVModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
export=True
)
model.save_pretrained("./llama-3.1-8b-ov")
# INT8 quantized export
model = OVModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
export=True,
load_in_8bit=True
)
NNCF Accuracy-Aware Quantization
NNCF (Neural Network Compression Framework)‘s accuracy-aware quantization is the core differentiating feature of the OpenVINO toolchain.
Important: NNCF does not provide any built-in accuracy metrics — validation_fn is a required parameter, and you must define how “accuracy” is measured yourself. This function takes (model, validation_dataset) as arguments and returns a numeric value (higher values indicate better model performance):
import nncf
from nncf import DropType
# You must define your own accuracy measurement function
def validate_fn(model, validation_dataset):
correct = 0
total = 0
for data in validation_dataset:
# Run inference with the model, compute accuracy (or any metric you care about)
output = model(data["input"])
correct += (output == data["label"])
total += 1
return correct / total # Higher is better
# Accuracy-aware quantization: constrain accuracy loss during the quantization process
quantized_model = nncf.quantize_with_accuracy_control(
model,
calibration_dataset=calibration_data,
validation_dataset=validation_data,
validation_fn=validate_fn, # Required, no default
max_drop=0.01, # Maximum allowed accuracy drop (default: 0.01)
drop_type=DropType.ABSOLUTE,# ABSOLUTE: absolute drop; RELATIVE: proportional drop
)
The return value of validation_fn is entirely up to you — it can be task accuracy, F1 score, BLEU, or even the negation of perplexity (since perplexity is lower-is-better, but NNCF expects higher-is-better). drop_type controls how max_drop is interpreted: ABSOLUTE means absolute value drop (e.g., original 0.85, quantized must not fall below 0.84); RELATIVE means proportional drop (e.g., must not drop more than 1% of the original value).
Unlike pure post-training quantization (such as GPTQ), NNCF continuously calls your validation_fn during quantization to check accuracy — if any layer’s accuracy drops beyond the threshold after quantization, it automatically falls back that layer to higher precision (e.g., keeping it at FP16).
benchmark_app Performance Evaluation
Quantization affects not only accuracy but also inference speed. OpenVINO’s benchmark_app lets you measure both simultaneously:
# Measure FP16 latency
benchmark_app -m llama-3.1-8b-fp16.xml -d GPU -niter 100
# Measure INT8 latency
benchmark_app -m llama-3.1-8b-int8.xml -d GPU -niter 100
This allows you to build a complete accuracy-speed trade-off curve: How much faster is INT8 compared to FP16? How much accuracy is lost? Is this trade-off worth it for your scenario?
Deep Dive 3: llama.cpp Accuracy Evaluation
llama.cpp has built-in perplexity computation, making it the fastest way to evaluate GGUF quantization quality.
Built-in Perplexity Test
# Compute perplexity on WikiText-2
./llama-perplexity -m model-Q4_K_M.gguf \
-f wiki.test.raw \
--ctx-size 512 \
--chunks 100
Example output:
Final estimate: PPL = 6.3842 +/- 0.0312
GGUF Variant Comparison
A common workflow is comparing different GGUF quantization variants of the same model:
| Quantization Variant | Typical Size (8B) | Perplexity | Use Case |
|---|---|---|---|
| F16 | ~16 GB | 6.24 (baseline) | Accuracy reference |
| Q8_0 | ~8.5 GB | ~6.25 (+0.2%) | Nearly lossless, recommended first choice |
| Q6_K | ~6.6 GB | ~6.26 (+0.3%) | High quality, suitable for most scenarios |
| Q5_K_M | ~5.7 GB | ~6.30 (+1.0%) | Good balance point |
| Q4_K_M | ~4.9 GB | ~6.38 (+2.2%) | Most commonly used; check code tasks |
| Q3_K_M | ~3.9 GB | ~6.55 (+5.0%) | Use when memory-constrained; check specific tasks |
| Q2_K | ~3.2 GB | ~7.20 (+15.4%) | Only for extremely resource-constrained scenarios |
The data above are representative values for Llama 3.1 8B on WikiText-2, based on aggregated community GGUF evaluation results (TheBloke, bartowski, and other HuggingFace repositories). Specific values may vary depending on model version and evaluation conditions.
Impact of KV Cache Quantization
llama.cpp supports KV cache quantization (--cache-type-k q8_0 --cache-type-v q8_0), which introduces a small additional accuracy loss:
- KV cache INT8: Perplexity increase of approximately 0.1-0.3%, virtually no impact on most tasks
- KV cache INT4: Perplexity increase of approximately 0.5-1.5%, with more noticeable degradation in long-context scenarios
The accuracy impact of KV cache quantization has an additive relationship with weight quantization — if weights are already Q4_K_M, adding KV cache INT4 on top may push accuracy degradation beyond acceptable limits.
Perplexity vs. Task Accuracy
Perplexity is the easiest accuracy metric to obtain — it runs fast, requires no labeled data, and needs no complex evaluation framework. But it has a critical limitation: the relationship between perplexity and task-specific accuracy is not linear.
Perplexity vs 任务精度变化
Llama 3.1 8B GGUF — 量化等级对 perplexity 和 benchmark 分数的影响
关键发现
- Q5_K_M 及以上:perplexity 和任务精度基本同步,perplexity 是可靠的质量指标
- Q4_K_M 以下:任务精度(尤其代码)下降速度远超 perplexity 预期 — 不能仅靠 perplexity 判断
- Perplexity 是快速筛选工具,不能替代 task-specific 评估
Why Do They Diverge?
Perplexity measures the model’s overall ability to predict the next token — it’s the average prediction loss across all tokens. But different tasks care about different tokens:
- Knowledge QA (MMLU): The answer is typically a single option letter (A/B/C/D), and only very few tokens are critical. The large volume of “background” tokens in perplexity dilutes the signal from key tokens
- Math reasoning (GSM8K): Requires generating a complete reasoning chain, where a single error in an intermediate step leads to an incorrect final answer. Perplexity doesn’t specially weight these critical steps
- Code generation (HumanEval): As mentioned earlier, syntactic precision requirements are extremely high. Perplexity doesn’t penalize “almost correct but syntactically wrong” code any more than other errors
Practical Recommendations
- Perplexity is good for quick screening: If perplexity increases by more than 3%, task accuracy has very likely degraded noticeably — you can rule it out immediately
- Perplexity is not suitable for fine-grained decisions: The perplexity difference between Q4_K_M and Q5_K_M might be only 1%, but on code tasks they could differ by 2-3 percentage points
- Critical scenarios require actual benchmarking: If your application involves code generation or math reasoning, perplexity isn’t enough — you need full evaluation on your target benchmarks
Toolchain Comparison Summary
The three toolchains each have their positioning; the choice depends on your scenario:
精度评估工具链对比
三种主流工具链的定位与适用场景
lm-evaluation-harness
OpenVINO (Optimum Intel + NNCF)
llama.cpp perplexity
决策指南
Best Practices for Combined Use
In real-world projects, these three tools are often not either/or, but used in phases:
- Quick screening phase: Use llama.cpp perplexity to quickly eliminate obviously unqualified quantization variants (immediately discard any with perplexity increase >5%)
- Detailed evaluation phase: Use lm-eval-harness for full evaluation on target benchmarks (select 2-3 candidate variants)
- Deployment validation phase: If targeting Intel platforms, use the OpenVINO toolchain for joint accuracy + performance evaluation
Transition: From “How Much Accuracy Is Lost” to “Which Model Should I Choose”
This article addressed the question of “how to measure accuracy after optimization.” But in real-world model selection, accuracy is only one dimension of the decision — you also need to consider inference speed, memory usage, deployment cost, and the signals from various leaderboards.
The next article, Leaderboards and Model Selection, will systematically analyze the design differences and applicable scenarios of major leaderboards (Open LLM Leaderboard, Chatbot Arena, LiveBench), and build a practical model selection decision framework.