Quantization-Aware Training (QAT)
Updated 2026-04-06
The Core Problem: Why Is Post-Training Quantization Not Enough?
In Quantization Fundamentals, we learned the basic principles of PTQ (Post-Training Quantization): directly applying quantization mapping to weights after training is complete. This approach is simple and efficient, but suffers from catastrophic accuracy degradation in ultra-low bit scenarios (≤4 bit):
- Quantization noise accumulation: Quantization error at each layer accumulates and amplifies through deep networks
- Activation distribution mismatch: The model is trained assuming FP32 precision; after quantization, activation distributions shift
- Vanishing gradients: The round() operation has a derivative of exactly 0, making it impossible to recover accuracy through fine-tuning
Quantization-Aware Training (QAT) addresses this by simulating quantization noise during training, allowing the model to learn to adapt to low-precision representations and maintain accuracy even at ultra-low bit widths.
QAT Core Mechanism: Fake Quantization + Straight-Through Estimator
The QAT training pipeline introduces two key techniques:
- Fake Quantization: Insert quantize-dequantize operations during the forward pass to simulate the precision loss at inference time
- Straight-Through Estimator (STE): During backpropagation, let gradients “pass through” the non-differentiable round() operation
Mathematical Formulation
Standard quantization function (non-differentiable):
STE approximate gradient (Bengio et al. 2013):
During training, an FP32 Master Weight is maintained. The forward pass uses , while backpropagation gradients directly update :
LoRA-QAT vs QLoRA: Two Approaches to Quantization + Fine-Tuning
In the LLM era, combining QAT with LoRA (Low-Rank Adaptation) has produced two distinct technical approaches:
Approach 1: QLoRA (Quantize First, Then Fine-Tune)
QLoRA (Dettmers et al. 2023) applies PTQ to quantize the pretrained model, then fine-tunes with FP16 LoRA adapters:
# QLoRA pipeline (simplified example)
base_model = quantize_to_4bit(pretrained_model) # PTQ quantize base model
lora_adapter = LoRA(rank=16, dtype=fp16) # FP16 low-rank adapter
for batch in dataloader:
out = base_model(batch) + lora_adapter(batch) # Quantized weights + FP16 adapter
loss.backward() # Only update LoRA parameters
Advantages: Memory efficient (4-bit base + small adapter), no need to retrain the base model
Disadvantages: Base weights are already quantized to 4-bit and cannot adapt to quantization noise through training
Approach 2: LoRA-QAT (Quantization-Aware During Training)
LQ-LoRA (Han et al. 2023) applies Fake Quantization to base weights during LoRA fine-tuning:
# LoRA-QAT pipeline (simplified example)
base_model = pretrained_model # Keep FP32
lora_adapter = LoRA(rank=16, dtype=fp16)
for batch in dataloader:
W_q = fake_quantize(base_model.weight, bits=4) # Fake quantization
out = matmul(W_q, batch) + lora_adapter(batch)
loss.backward() # Update base and adapter via STE
Advantages: Base weights become aware of quantization noise during training, yielding higher final accuracy
Disadvantages: Requires inserting Fake Quant nodes during fine-tuning, slightly higher training cost
Extreme Case: BitNet’s 1.58-bit QAT
BitNet (Wang et al. 2023) uses QAT to quantize Transformer weights to 1 ternary values, achieving extreme compression:
Computational Advantages of Ternary Quantization
The core of BitNet is absmean quantization:
where is a threshold hyperparameter controlling sparsity.
Training Strategy
- Train ternary weights directly (not PTQ): Initialize as FP32, quantize to 1 before each forward pass
- Use STE gradients:
- Keep activations at 8-bit: 1.58-bit weights + 8-bit activations balance accuracy and efficiency
Experimental results (LLaMA architecture):
- 3B model: BitNet vs FP16 perplexity gap <3%
- Memory & energy: Inference energy reduced by 71%, throughput improved 2.7×
QAT vs PTQ Accuracy Boundary
Experimental data from Jacob et al. (2018) comparing on ResNet-50/ImageNet:
| Bit Width | PTQ Top-1 Acc | QAT Top-1 Acc | Gap |
|---|---|---|---|
| 8-bit | 76.1% | 76.5% | +0.4% |
| 4-bit | 68.3% | 74.8% | +6.5% |
| 3-bit | 42.1% | 71.2% | +29.1% |
| 2-bit | 12.5% | 65.4% | +52.9% |
Key Takeaways:
- ≥4 bit: PTQ is sufficient; QAT provides diminishing returns
- ≤3 bit: PTQ accuracy collapses; QAT becomes essential
- Engineering trade-off: QAT requires retraining, but is the only viable approach at ultra-low bit widths
Further Reading
- Quantization Fundamentals — Basic principles of PTQ
- INT8 Training Techniques — Full INT8 quantization during training
- Mixed-Precision Training — Gradient accumulation strategies combining FP16 and FP32