推理时量化：KV Cache 与 Activation 量化

简介

权重量化解决了模型存储与加载问题，但在实际推理时，内存瓶颈往往来自 动态生成的中间数据：KV cache 和 activation tensors。对于长上下文推理，KV cache 可能占用数十 GB 显存；对于高吞吐场景，activation memory 的带宽消耗成为关键限制因素。

推理时量化 专注于这些运行时数据的压缩，与权重量化互补，共同构成完整的端到端量化方案。

在 autoregressive 生成中，每个 token 的 key 和 value vectors 被缓存。内存占用公式：

\text{KV size} = 2 \times L \times n_h \times d_h \times S \times \text{bytes}

以 Llama 3 70B (80 layers, 64 heads, d_head=128) + 128K context 为例：FP16 下 KV cache ≈ 160 GB，远超模型权重（~140 GB）。

Key 参与 $QK^T$ 点积运算，量化误差被放大；Value 参与加权求和 $\text{softmax} \cdot V$ ，误差被平均。因此 Key 需要更高精度。

1. 原始 Attention Scores

KIVI (ICLR 2024)：Key per-channel INT8 + Value per-token INT2，利用 Key/Value 分布差异
KVQuant：非均匀量化 + dense-and-sparse 混合方案，少量 outlier 保留 FP16

实验结果：Llama 2 70B + 100K context，KV cache 从 32 GB 降至 8 GB，perplexity 增加 < 0.1。

推理时每层的 activation tensors 也需量化以减少 memory bandwidth。但 activation 存在 系统性 outlier：少数 channel 的值比其他大 100-1000 倍。

原因：LayerNorm 后分布不均匀；特定 token 在深层触发 outlier；outlier channel 跨 token 一致。

解决方案：

格式	Exponent	Mantissa	动态范围	用途
E4M3	4 bit	3 bit	±448	Forward activation (高精度)
E5M2	5 bit	2 bit	±57,344	Gradient (大范围)

H100 FP8 Tensor Core 吞吐量是 FP16 的 2 倍。FP8 vs INT8：FP8 保留动态范围更适合非均匀分布，INT8 在均匀分布时精度更高。

典型配置：

工具链：

推理时量化的核心挑战：KV cache 是长上下文的主要显存瓶颈，activation outlier 需要 mixed-precision 或 FP8。量化不是单一决策，是 weight/activation/KV cache 三个维度的联合优化。