Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Inference-Time Quantization: KV Cache and Activation Quantization

Inference-Time Quantization: KV Cache and Activation Quantization

Updated 2026-04-06

Introduction

Weight quantization addresses model storage and loading, but during actual inference, memory bottlenecks often come from dynamically generated intermediate data: KV cache and activation tensors. For long-context inference, the KV cache can consume tens of GB of GPU memory; for high-throughput scenarios, activation memory bandwidth becomes the critical limiting factor.

Inference-time quantization focuses on compressing these runtime data structures, complementing weight quantization to form a complete end-to-end quantization solution.


1. The KV Cache Memory Bottleneck

In autoregressive generation, each token’s key and value vectors are cached. The memory footprint formula:

KV size=2×L×nh×dh×S×bytes\text{KV size} = 2 \times L \times n_h \times d_h \times S \times \text{bytes}

Taking Llama 3 70B (80 layers, 64 heads, d_head=128) + 128K context as an example: the KV cache in FP16 ≈ 160 GB, far exceeding the model weights (~140 GB).

KV Cache Memory Comparison (Llama 3 8B)Llama 3 8BLlama 3 70BQwen3 72BSequence Length:2,048 tokens1.2 GB01.1 GBFP16537 MBINT8537 MBFP8268 MBINT4KV = 2 × layers × heads × d_head × seq_len × bytes

2. KV Cache Quantization Methods

Per-Token vs Per-Channel

  • Per-token: Computes an independent scale for each token’s K/V vector, adapting to dynamic range
  • Per-channel: One scale per head dimension, lower overhead but limited accuracy

Key vs Value Asymmetry

Keys participate in the QKTQK^T dot product, where quantization errors are amplified; Values participate in the weighted sum softmaxV\text{softmax} \cdot V, where errors are averaged out. Therefore, Keys require higher precision.

1. Original Attention Scores
S = QK^T / √d (full precision)Attention Scores S0.420.18-0.050.310.280.550.09-0.12-0.150.330.610.140.08-0.220.190.47Baseline: attention scores computed with full-precision Q and K

KIVI and KVQuant

  • KIVI (ICLR 2024): Key per-channel INT8 + Value per-token INT2, exploiting the distributional differences between Keys and Values
  • KVQuant: Non-uniform quantization + dense-and-sparse hybrid scheme, retaining a small number of outliers in FP16

Experimental results: Llama 2 70B + 100K context, KV cache reduced from 32 GB to 8 GB with perplexity increase < 0.1.


3. Challenges of Activation Quantization

Activation tensors at each layer also need quantization during inference to reduce memory bandwidth. However, activations contain systematic outliers: a few channels have values 100-1000x larger than others.

Causes: Non-uniform distribution after LayerNorm; specific tokens trigger outliers in deeper layers; outlier channels are consistent across tokens.

Activation Outlier Impact on QuantizationActivation Matrix (8 tokens × 16 channels)OutlierOutlierPer-TensorPer-ChannelQuantization Error (per-tensor)Per-tensor: Outliers inflate scale → normal values lose precision (red)

Solutions:

  • Mixed-precision: Keep outlier channels in FP16, quantize the rest to INT8
  • Per-channel scaling: Quantize each channel independently
  • Smoothing: SmoothQuant adjusts distributions post-training

4. FP8: Hardware-Friendly Low-Precision Floating Point

FormatExponentMantissaDynamic RangeUse Case
E4M34 bit3 bit±448Forward activation (higher precision)
E5M25 bit2 bit±57,344Gradient (wider range)

H100 FP8 Tensor Core throughput is 2x that of FP16. FP8 vs INT8: FP8 preserves dynamic range better for non-uniform distributions, while INT8 offers higher precision for uniform distributions.


5. End-to-End Quantized Deployment

End-to-End Inference Quantization StackInput TokensEmbeddingFP16Attention QKV ProjectionINT4 weight → dequant → FP16dequantAttention Score (softmax)FP16 (high precision)KV CacheINT8 / FP8Attention OutputFP16FFNINT4 weight → dequant → FP16dequantLayerNormFP32Output LogitsFP16Key points:• Weights stored in INT4, dequant to FP16 at inference time• KV Cache INT8/FP8 reduces long-context memory• Softmax + LayerNorm require high precision (FP16/FP32)

Typical Configurations:

  • Memory-first: W4A16 + KV INT8 (GPTQ/AWQ weight + per-token KV quant)
  • Speed-first: W8A8 + KV FP8 (SmoothQuant + H100 FP8 Tensor Core)
  • Balanced: W4A16 + KV INT4 (AWQ weight + KIVI KV cache)

Toolchain:

  • llama.cpp: --cache-type-k q4_0 enables KV cache quantization
  • vLLM: FP8 KV cache (--kv-cache-dtype fp8_e4m3)
  • TensorRT-LLM: End-to-end INT4 weight + FP8 activation

Summary

The core challenges of inference-time quantization: KV cache is the primary memory bottleneck for long contexts, and activation outliers require mixed-precision or FP8. Quantization is not a single decision — it is a joint optimization across three dimensions: weights, activations, and KV cache.

Further Reading