Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Quantization-Aware Training (QAT)

Quantization-Aware Training (QAT)

Updated 2026-04-06

The Core Problem: Why Is Post-Training Quantization Not Enough?

In Quantization Fundamentals, we learned the basic principles of PTQ (Post-Training Quantization): directly applying quantization mapping to weights after training is complete. This approach is simple and efficient, but suffers from catastrophic accuracy degradation in ultra-low bit scenarios (≤4 bit):

  • Quantization noise accumulation: Quantization error at each layer accumulates and amplifies through deep networks
  • Activation distribution mismatch: The model is trained assuming FP32 precision; after quantization, activation distributions shift
  • Vanishing gradients: The round() operation has a derivative of exactly 0, making it impossible to recover accuracy through fine-tuning

Quantization-Aware Training (QAT) addresses this by simulating quantization noise during training, allowing the model to learn to adapt to low-precision representations and maintain accuracy even at ultra-low bit widths.

QAT Core Mechanism: Fake Quantization + Straight-Through Estimator

The QAT training pipeline introduces two key techniques:

  1. Fake Quantization: Insert quantize-dequantize operations during the forward pass to simulate the precision loss at inference time
  2. Straight-Through Estimator (STE): During backpropagation, let gradients “pass through” the non-differentiable round() operation
1. FP32 Master Weight
FP32 Master WeightW = [0.347, -0.892, ...]Full precision 32-bitInitialize as full precision weights at training startAlways maintain FP32 precision throughout training

Mathematical Formulation

Standard quantization function (non-differentiable):

Q(x)=round(xs)sQ(x) = \text{round}\left(\frac{x}{s}\right) \cdot s

STE approximate gradient (Bengio et al. 2013):

Q(x)x{1if x<clip_range0otherwise\frac{\partial Q(x)}{\partial x} \approx \begin{cases} 1 & \text{if } |x| < \text{clip\_range} \\ 0 & \text{otherwise} \end{cases}

During training, an FP32 Master Weight WW is maintained. The forward pass uses Q(W)Q(W), while backpropagation gradients directly update WW:

Wt+1=WtηLQ(W)(gradient approximated as LW)W_{t+1} = W_t - \eta \cdot \frac{\partial L}{\partial Q(W)} \quad (\text{gradient approximated as } \frac{\partial L}{\partial W})

LoRA-QAT vs QLoRA: Two Approaches to Quantization + Fine-Tuning

In the LLM era, combining QAT with LoRA (Low-Rank Adaptation) has produced two distinct technical approaches:

Approach 1: QLoRA (Quantize First, Then Fine-Tune)

QLoRA (Dettmers et al. 2023) applies PTQ to quantize the pretrained model, then fine-tunes with FP16 LoRA adapters:

# QLoRA pipeline (simplified example)
base_model = quantize_to_4bit(pretrained_model)  # PTQ quantize base model
lora_adapter = LoRA(rank=16, dtype=fp16)         # FP16 low-rank adapter

for batch in dataloader:
    out = base_model(batch) + lora_adapter(batch)  # Quantized weights + FP16 adapter
    loss.backward()  # Only update LoRA parameters

Advantages: Memory efficient (4-bit base + small adapter), no need to retrain the base model
Disadvantages: Base weights are already quantized to 4-bit and cannot adapt to quantization noise through training

Approach 2: LoRA-QAT (Quantization-Aware During Training)

LQ-LoRA (Han et al. 2023) applies Fake Quantization to base weights during LoRA fine-tuning:

# LoRA-QAT pipeline (simplified example)
base_model = pretrained_model  # Keep FP32
lora_adapter = LoRA(rank=16, dtype=fp16)

for batch in dataloader:
    W_q = fake_quantize(base_model.weight, bits=4)  # Fake quantization
    out = matmul(W_q, batch) + lora_adapter(batch)
    loss.backward()  # Update base and adapter via STE

Advantages: Base weights become aware of quantization noise during training, yielding higher final accuracy
Disadvantages: Requires inserting Fake Quant nodes during fine-tuning, slightly higher training cost

Extreme Case: BitNet’s 1.58-bit QAT

BitNet (Wang et al. 2023) uses QAT to quantize Transformer weights to 1 ternary values, achieving extreme compression:

Computational Advantages of Ternary Quantization

Traditional FP16 vs BitNet Ternary Computation ComparisonTraditional FP16 (2×2 × 2×2)A =0.8 -1.20.5 0.9B =0.6 0.3-0.4 1.1Cell (0,0) Computation Path0.8 × 0.6 = 0.48+ (-1.2) × (-0.4) = 0.48= 0.962 multiplications1 additionTotal: 4 multiplications + 2 additions= 6 FLOPsBitNet Ternary (2×2 × 2×2)A =+1 -1+1 +1B =+1 0-1 +1Cell (0,0) Computation Path(+1)×(+1) → add X(-1)×(-1) → add X= +2 (0 次乘法)0 multiplications2 additionsTotal: 0 multiplications + 2 add/sub+ 2 skips (W=0)Key Advantage: BitNet Eliminates All Multiplications{-1, 0, +1} only needs adders and sign bit, no floating-point multipliersClick computation path box to see detailed steps

The core of BitNet is absmean quantization:

Wternary=sign(W)1W>αmean(W)W_{\text{ternary}} = \text{sign}(W) \cdot \mathbb{1}_{|W| > \alpha \cdot \text{mean}(|W|)}

where α0.5\alpha \approx 0.5 is a threshold hyperparameter controlling sparsity.

Training Strategy

  1. Train ternary weights directly (not PTQ): Initialize as FP32, quantize to 1 before each forward pass
  2. Use STE gradients: sign(x)x1x1\frac{\partial \text{sign}(x)}{\partial x} \approx \mathbb{1}_{|x| \leq 1}
  3. Keep activations at 8-bit: 1.58-bit weights + 8-bit activations balance accuracy and efficiency

Experimental results (LLaMA architecture):

  • 3B model: BitNet vs FP16 perplexity gap <3%
  • Memory & energy: Inference energy reduced by 71%, throughput improved 2.7×

QAT vs PTQ Accuracy Boundary

QAT vs PTQ Accuracy Boundary: Perplexity Growth Curve010203040123468Bit WidthPerplexity Growth (%)Crossing~3 bitPTQ (Post-Training)QAT (Quantization-Aware)Ultra-low bit region (≤3 bit)QAT significantly better than PTQPTQ accuracy collapses rapidlyRequires quantization-aware training

Experimental data from Jacob et al. (2018) comparing on ResNet-50/ImageNet:

Bit WidthPTQ Top-1 AccQAT Top-1 AccGap
8-bit76.1%76.5%+0.4%
4-bit68.3%74.8%+6.5%
3-bit42.1%71.2%+29.1%
2-bit12.5%65.4%+52.9%

Key Takeaways:

  • ≥4 bit: PTQ is sufficient; QAT provides diminishing returns
  • ≤3 bit: PTQ accuracy collapses; QAT becomes essential
  • Engineering trade-off: QAT requires retraining, but is the only viable approach at ultra-low bit widths

Further Reading