Impact of Optimization on Accuracy

Opening: 2-4x Speedup — But How Much Accuracy Do You Actually Lose?

Quantization, pruning, KV cache compression — these optimization techniques promise 2-4x inference speedups and halved memory usage for LLMs. But at what cost? How much accuracy do you actually lose? More importantly: how do you verify this yourself?

This isn’t a simple question. With the same INT4 quantization, a 7B model might lose only 3 percentage points on MMLU, but plummet 16% on HumanEval. The same quantization method might be barely noticeable on a 70B model but could impact practical usability on a 7B model.

This article focuses on evaluation methodology — we won’t cover the internals of quantization algorithms (that belongs in the quantization path), but instead answer three core questions:

How large is the accuracy cost of different optimization techniques? The sensitivity differences across benchmarks are striking
How do you measure it yourself? A detailed look at three toolchains: lm-evaluation-harness, OpenVINO, and llama.cpp
Is perplexity enough? When it’s a good metric, and when it can mislead you

Scope note: This article belongs to the evaluation path and focuses on “how to measure accuracy loss.” For the internals of quantization algorithms themselves (GPTQ, AWQ, GGUF quantization types, etc.), see the dedicated articles in the quantization path.

The Full Landscape of Optimization Techniques and Accuracy Costs

Before diving into evaluation methods, let’s establish a panoramic view — the typical accuracy cost ranges for mainstream optimization techniques:

Optimization Type	Typical Methods	Typical Accuracy Loss	Sensitivity Notes
Weight-only Quantization	GPTQ, AWQ, bitsandbytes NF4	INT8: <1%; INT4: 1-5%	Code/math tasks are most sensitive
Weight + Activation Quantization	SmoothQuant, ZeroQuant	INT8: 0.5-2%; W4A8: 2-8%	Activation quantization adds extra loss
GGUF Quantization	llama.cpp Q4_K_M, Q3_K_M, etc.	Q4: 1-3%; Q2: 8-20%	Degradation accelerates sharply below Q4
FP8 Quantization	FP8 E4M3/E5M2	<0.5%	Nearly lossless, but requires hardware support
KV Cache Quantization	KV cache INT8/INT4	0.1-2%	Greater impact in long-context scenarios
Sparsification	SparseGPT, Wanda	10-50% sparse: 1-5%	Non-linear degradation at high sparsity rates

Key insight: These are typical ranges, not fixed values. Actual degradation depends on model architecture, scale, calibration data, and the specific task you care about. This is precisely why you need to measure empirically.

Two crucial patterns:

Larger models are more quantization-resilient: 70B models have far more parameter redundancy than 7B models, so quantization noise has less impact on overall output
Task type determines sensitivity: Code generation (requires precise syntax) > math reasoning (requires precise computation chains) > knowledge QA (more tolerant of errors)

The Unevenness of Degradation

The “typical accuracy loss” mentioned above is an average — but in reality, degradation varies enormously across different benchmarks. This is the most commonly overlooked pitfall when choosing evaluation metrics.

The interactive chart below lets you explore this unevenness yourself:

量化精度退化探索器

选择模型规模和视图模式，探索不同量化方法对各 benchmark 的影响

数据为基于文献和社区评测的代表性趋势值，具体分数因模型版本和评测条件而异

模型规模:

视图:

关键发现

敏感度排序：代码类 > 数学类 > 知识类 — 代码生成对量化最为敏感
大模型更耐量化：70B 模型在 INT4 下精度损失远小于 7B（冗余参数提供了缓冲）
当前视图 (7B): 最大退化 16.7% 出现在 INT4 × MATH

FP16

INT8

INT4

FP8

Why Is Code the Most Sensitive?

Code generation is particularly sensitive to quantization because its tolerance for errors is extremely small:

Syntactic rigidity: A missing bracket or wrong indentation makes the code completely non-executable. In knowledge QA, “roughly correct” still earns partial credit; in code, “roughly correct” equals zero
Precise token selection: Code generation is highly dependent on subtle differences in token probability distributions — the small probability shifts introduced by quantization can cause the model to select the wrong critical token (e.g., == becoming =)
Long-range dependencies: A function’s correctness depends on the precision of all preceding variable declarations and import statements. Quantization errors accumulate over long sequences

Why Are Larger Models More Quantization-Resilient?

This pattern is very clear in the data (switch to 70B vs. 7B comparison to see it):

Parameter redundancy: Larger models have more “backup channels,” making it easier for quantization errors in individual weights to be compensated by other channels
Smoother loss surface: Weight distributions in larger models tend to be smoother with a lower proportion of outliers, making them more quantization-friendly
Practical implication: If your task is extremely accuracy-sensitive (e.g., medical code generation), prefer a large model + moderate quantization (e.g., INT8) over a small model + no quantization

Deep Dive 1: lm-evaluation-harness Hands-On Workflow

lm-evaluation-harness (commonly called lm-eval) is currently the most mainstream LLM evaluation framework, maintained by EleutherAI, covering 60+ standard benchmarks. It’s the go-to tool for comparing accuracy before and after quantization.

5-Step Hands-On Workflow

Step 1: Installation

pip install lm-eval
# Or install from source (to get the latest benchmarks)
pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git

Step 2: Evaluate the FP16 Baseline

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
  --tasks mmlu,gsm8k,humaneval \
  --device cuda:0 \
  --batch_size auto

Step 3: Evaluate the Quantized Version

# GPTQ quantized model (using a community-quantized version as an example)
lm_eval --model hf \
  --model_args pretrained=<your-gptq-model-path> \
  --tasks mmlu,gsm8k,humaneval \
  --device cuda:0 \
  --batch_size auto

# GGUF quantized model
lm_eval --model hf \
  --model_args pretrained=/path/to/gguf_folder,gguf_file=model-Q4_K_M.gguf,tokenizer=meta-llama/Llama-3.1-8B-Instruct \
  --tasks mmlu,gsm8k,humaneval

Note: GGUF models require a separately specified tokenizer path; otherwise, lm-eval will attempt to reconstruct the vocabulary from the GGUF file — this can take hours and produce inaccurate results.

Step 4: Compare Results

# lm-eval outputs in JSON format; compare manually or use a script
# Key fields: results -> task_name -> metric_name

Step 5: Multiple-Run Validation

For critical decisions (such as production deployment), it’s recommended to run at least 3 times and average — some benchmarks (especially HumanEval’s pass@k) have high variance.

Backend Integration

lm-eval supports multiple inference backends:

Backend	Command Argument	Use Case
HuggingFace Transformers	`--model hf`	Default, broadest support
vLLM	`--model vllm`	High-throughput evaluation, supports tensor parallelism
SGLang	`--model sglang`	FP8/INT4/AWQ/GPTQ quantization
OpenVINO	`--model openvino`	Intel platform optimization
API	`--model openai`	Commercial API models

Common Pitfalls

Prompt format sensitivity: The same benchmark can yield 5-10% score differences under different prompt templates. Ensure the baseline and quantized versions use the exact same prompt configuration
Few-shot count: The standard for MMLU is 5-shot, GSM8K is typically 8-shot. Changing the few-shot count will significantly affect results
Temperature setting: HumanEval’s pass@k is extremely sensitive to temperature. The optimal temperature for pass@1 is 0.2 (Codex paper), and 0.8 for pass@100
Batch size impact: Some quantization methods are sensitive to batch size — ensure comparison experiments use the same batch size

Deep Dive 2: OpenVINO Accuracy Evaluation Toolchain

For Intel hardware users, OpenVINO provides a complete accuracy evaluation and optimization toolchain. Its unique advantage is accuracy-aware quantization — actively constraining accuracy loss during the quantization process.

Optimum Intel Conversion and Evaluation

Optimum Intel is a library co-developed by HuggingFace and Intel, providing seamless conversion from HuggingFace models to OpenVINO format:

from optimum.intel import OVModelForCausalLM

# Export to OpenVINO IR format
model = OVModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    export=True
)
model.save_pretrained("./llama-3.1-8b-ov")

# INT8 quantized export
model = OVModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    export=True,
    load_in_8bit=True
)

NNCF Accuracy-Aware Quantization

NNCF (Neural Network Compression Framework)‘s accuracy-aware quantization is the core differentiating feature of the OpenVINO toolchain.

Important: NNCF does not provide any built-in accuracy metrics — validation_fn is a required parameter, and you must define how “accuracy” is measured yourself. This function takes (model, validation_dataset) as arguments and returns a numeric value (higher values indicate better model performance):

import nncf
from nncf import DropType

# You must define your own accuracy measurement function
def validate_fn(model, validation_dataset):
    correct = 0
    total = 0
    for data in validation_dataset:
        # Run inference with the model, compute accuracy (or any metric you care about)
        output = model(data["input"])
        correct += (output == data["label"])
        total += 1
    return correct / total  # Higher is better

# Accuracy-aware quantization: constrain accuracy loss during the quantization process
quantized_model = nncf.quantize_with_accuracy_control(
    model,
    calibration_dataset=calibration_data,
    validation_dataset=validation_data,
    validation_fn=validate_fn,  # Required, no default
    max_drop=0.01,              # Maximum allowed accuracy drop (default: 0.01)
    drop_type=DropType.ABSOLUTE,# ABSOLUTE: absolute drop; RELATIVE: proportional drop
)

The return value of validation_fn is entirely up to you — it can be task accuracy, F1 score, BLEU, or even the negation of perplexity (since perplexity is lower-is-better, but NNCF expects higher-is-better). drop_type controls how max_drop is interpreted: ABSOLUTE means absolute value drop (e.g., original 0.85, quantized must not fall below 0.84); RELATIVE means proportional drop (e.g., must not drop more than 1% of the original value).

Unlike pure post-training quantization (such as GPTQ), NNCF continuously calls your validation_fn during quantization to check accuracy — if any layer’s accuracy drops beyond the threshold after quantization, it automatically falls back that layer to higher precision (e.g., keeping it at FP16).

benchmark_app Performance Evaluation

Quantization affects not only accuracy but also inference speed. OpenVINO’s benchmark_app lets you measure both simultaneously:

# Measure FP16 latency
benchmark_app -m llama-3.1-8b-fp16.xml -d GPU -niter 100

# Measure INT8 latency
benchmark_app -m llama-3.1-8b-int8.xml -d GPU -niter 100

This allows you to build a complete accuracy-speed trade-off curve: How much faster is INT8 compared to FP16? How much accuracy is lost? Is this trade-off worth it for your scenario?

Deep Dive 3: llama.cpp Accuracy Evaluation

llama.cpp has built-in perplexity computation, making it the fastest way to evaluate GGUF quantization quality.

Built-in Perplexity Test

# Compute perplexity on WikiText-2
./llama-perplexity -m model-Q4_K_M.gguf \
  -f wiki.test.raw \
  --ctx-size 512 \
  --chunks 100

Example output:

Final estimate: PPL = 6.3842 +/- 0.0312

GGUF Variant Comparison

A common workflow is comparing different GGUF quantization variants of the same model:

Quantization Variant	Typical Size (8B)	Perplexity	Use Case
F16	~16 GB	6.24 (baseline)	Accuracy reference
Q8_0	~8.5 GB	~6.25 (+0.2%)	Nearly lossless, recommended first choice
Q6_K	~6.6 GB	~6.26 (+0.3%)	High quality, suitable for most scenarios
Q5_K_M	~5.7 GB	~6.30 (+1.0%)	Good balance point
Q4_K_M	~4.9 GB	~6.38 (+2.2%)	Most commonly used; check code tasks
Q3_K_M	~3.9 GB	~6.55 (+5.0%)	Use when memory-constrained; check specific tasks
Q2_K	~3.2 GB	~7.20 (+15.4%)	Only for extremely resource-constrained scenarios

The data above are representative values for Llama 3.1 8B on WikiText-2, based on aggregated community GGUF evaluation results (TheBloke, bartowski, and other HuggingFace repositories). Specific values may vary depending on model version and evaluation conditions.

Impact of KV Cache Quantization

llama.cpp supports KV cache quantization (--cache-type-k q8_0 --cache-type-v q8_0), which introduces a small additional accuracy loss:

KV cache INT8: Perplexity increase of approximately 0.1-0.3%, virtually no impact on most tasks
KV cache INT4: Perplexity increase of approximately 0.5-1.5%, with more noticeable degradation in long-context scenarios

The accuracy impact of KV cache quantization has an additive relationship with weight quantization — if weights are already Q4_K_M, adding KV cache INT4 on top may push accuracy degradation beyond acceptable limits.

Perplexity vs. Task Accuracy

Perplexity is the easiest accuracy metric to obtain — it runs fast, requires no labeled data, and needs no complex evaluation framework. But it has a critical limitation: the relationship between perplexity and task-specific accuracy is not linear.

Perplexity vs 任务精度变化

Llama 3.1 8B GGUF — 量化等级对 perplexity 和 benchmark 分数的影响

Perplexity 变化率

HumanEval 精度变化率

关键发现

Q5_K_M 及以上：perplexity 和任务精度基本同步，perplexity 是可靠的质量指标
Q4_K_M 以下：任务精度（尤其代码）下降速度远超 perplexity 预期 — 不能仅靠 perplexity 判断
Perplexity 是快速筛选工具，不能替代 task-specific 评估

Why Do They Diverge?

Perplexity measures the model’s overall ability to predict the next token — it’s the average prediction loss across all tokens. But different tasks care about different tokens:

Knowledge QA (MMLU): The answer is typically a single option letter (A/B/C/D), and only very few tokens are critical. The large volume of “background” tokens in perplexity dilutes the signal from key tokens
Math reasoning (GSM8K): Requires generating a complete reasoning chain, where a single error in an intermediate step leads to an incorrect final answer. Perplexity doesn’t specially weight these critical steps
Code generation (HumanEval): As mentioned earlier, syntactic precision requirements are extremely high. Perplexity doesn’t penalize “almost correct but syntactically wrong” code any more than other errors

Practical Recommendations

Perplexity is good for quick screening: If perplexity increases by more than 3%, task accuracy has very likely degraded noticeably — you can rule it out immediately
Perplexity is not suitable for fine-grained decisions: The perplexity difference between Q4_K_M and Q5_K_M might be only 1%, but on code tasks they could differ by 2-3 percentage points
Critical scenarios require actual benchmarking: If your application involves code generation or math reasoning, perplexity isn’t enough — you need full evaluation on your target benchmarks

Toolchain Comparison Summary

The three toolchains each have their positioning; the choice depends on your scenario:

精度评估工具链对比

三种主流工具链的定位与适用场景

📊

lm-evaluation-harness

适用场景

通用模型评估，量化前后对比

支持格式

HuggingFace, vLLM, SGLang, GGUF (有限支持)

评估指标

所有 benchmark 分数 (accuracy, F1, pass@k 等)

Benchmark 覆盖

最广：60+ 主流 benchmark，数百子任务

硬件平台

主要 GPU (NVIDIA)，CPU 可用但慢

🔧

OpenVINO (Optimum Intel + NNCF)

适用场景

Intel 硬件部署评估，accuracy-aware 量化

支持格式

OpenVINO IR (从 HuggingFace 转换)

评估指标

benchmark 分数 + 吞吐/延迟 (benchmark_app)

Benchmark 覆盖

通过 lm-eval-harness 集成覆盖主流 benchmark

硬件平台

Intel CPU / iGPU / Arc GPU（核心优势）

🦙

llama.cpp perplexity

适用场景

GGUF 量化质量快速检查

支持格式

仅 GGUF

评估指标

Perplexity (WikiText-2 等)

Benchmark 覆盖

仅 perplexity — 不含 task-specific benchmark

硬件平台

CPU / GPU / Apple Silicon（跨平台）

决策指南

→需要全面评估多个 benchmark→ lm-evaluation-harness

→在 Intel 平台部署，需要精度+性能联合评估→ OpenVINO

→快速验证 GGUF 量化质量→ llama.cpp perplexity

Best Practices for Combined Use

In real-world projects, these three tools are often not either/or, but used in phases:

Quick screening phase: Use llama.cpp perplexity to quickly eliminate obviously unqualified quantization variants (immediately discard any with perplexity increase >5%)
Detailed evaluation phase: Use lm-eval-harness for full evaluation on target benchmarks (select 2-3 candidate variants)
Deployment validation phase: If targeting Intel platforms, use the OpenVINO toolchain for joint accuracy + performance evaluation

Transition: From “How Much Accuracy Is Lost” to “Which Model Should I Choose”

This article addressed the question of “how to measure accuracy after optimization.” But in real-world model selection, accuracy is only one dimension of the decision — you also need to consider inference speed, memory usage, deployment cost, and the signals from various leaderboards.

The next article, Leaderboards and Model Selection, will systematically analyze the design differences and applicable scenarios of major leaderboards (Open LLM Leaderboard, Chatbot Arena, LiveBench), and build a practical model selection decision framework.