Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

PTQ Weight Quantization: From GPTQ to AWQ

PTQ Weight Quantization: From GPTQ to AWQ

Updated 2026-04-06

Post-Training Quantization (PTQ) quantizes model weights directly after training is complete, without retraining. Its zero training cost advantage over QAT — requiring only a small amount of calibration data and a few minutes of processing time — makes PTQ the preferred quantization approach for LLM deployment. This article provides an in-depth analysis of the algorithmic principles and use cases of four mainstream PTQ methods.

The Limitations of Round-to-Nearest

The simplest quantization approach is Round-to-Nearest (RTN): directly rounding each weight to the nearest quantized value. RTN performs reasonably well at 8-bit — LLaMA2-7B’s perplexity only increases from FP16’s 5.47 to approximately 5.52. But at 4-bit, perplexity spikes to above 7.2.

The core problem: RTN processes each weight independently, ignoring correlations between weights. When quantizing one weight introduces an error, the impact of that error on model output depends on the relationship between that weight and other weights (the Hessian matrix). RTN completely ignores this structural information.

GPTQ: Hessian-Guided Error Compensation

GPTQ (Frantar et al., 2022) is based on a key insight: the error from quantizing one weight can be compensated by adjusting weights that have not yet been quantized.

The core of the algorithm is an efficient approximation of Optimal Brain Quantization (OBQ): it processes the weight matrix column by column, using H1H^{-1} (the inverse Hessian matrix) to optimally distribute the current column’s quantization error to subsequent columns. The update formula:

δF=wqw[HF1]qq(HF1):,q\delta_F = -\frac{w_q - w}{[H_F^{-1}]_{qq}} \cdot (H_F^{-1})_{:,q}

Key optimization: Lazy Batch Updates process every 128 columns in bulk, reducing memory access. Calibration requires forward propagation of 128-512 text samples to estimate the Hessian matrix.

Step 1: Original Weight Matrix W (FP16)
W (FP16) — Hessian H⁻¹ guides quantization order+0.0312-0.0187+0.0456-0.0089-0.0234+0.0401-0.0156+0.0278+0.0178-0.0345+0.0289-0.0198-0.0401+0.0156-0.0367+0.0423GPTQ per-column quantization: uses Hessian inverse to optimally distribute error to subsequent columns

Practical results: LLaMA2-7B at INT4-g128 achieves a perplexity of only 5.67, close to FP16’s 5.47. Quantizing a 7B model takes approximately 10 minutes on a single GPU.

AWQ: Activation-Aware Weight Quantization

AWQ (Lin et al., 2023) approaches the problem from a different angle: not all weights are equally important. Observations reveal that approximately 1% of channels in LLM activations are salient channels, with activation values far larger than other channels.

AWQ’s solution is per-channel scaling: multiply the weights of salient channels by a larger scale factor ss, making their quantization granularity finer. Simultaneously, activations are divided by ss to maintain mathematical equivalence:

Y=XW=(Xdiag(s)1)(diag(s)W)Y = XW = (X \cdot \text{diag}(s)^{-1}) \cdot (\text{diag}(s) \cdot W)

Activation Matrix X (click column to select channel)0.120.034.80.080.150.020.110.060.090.055.20.110.070.040.130.080.140.024.70.060.180.030.090.050.110.045.00.090.120.050.140.070.080.064.90.130.090.020.100.040.130.035.20.070.160.040.120.060.100.054.80.100.110.030.080.090.150.045.30.080.140.050.110.07ch0ch1ch2ch3ch4ch5ch6ch7Channel 2 ⚡ Salientmax=5.34 mean=4.989Scale factor s = 10.68AWQ scales up this channel weight → finer quantization → less errorQuantization error before scaling: 0.47Quantization error after scaling: 0.05Y = XW = (X·diag(s)⁻¹)·(diag(s)·W) — mathematically equivalent, protects 1% critical channelsAWQ is "preventive" (adjust distribution for quantization) vs GPTQ is "remedial" (compensate error)

AWQ’s key advantages: no backpropagation required (quantization is fast at ~5 min), outputs standard INT4 format (compatible with vLLM/TRT-LLM), and excellent quality (LLaMA2-7B INT4-g128 perplexity ~5.60).

The fundamental difference from GPTQ: GPTQ modifies unquantized weights during the quantization process to compensate for errors (“after-the-fact remedy”), while AWQ adjusts weight distributions before quantization to make them easier to quantize (“preventive preparation”). The two can be combined.

SmoothQuant: Smoothing Activation Outliers

The previous methods focus on weight-only quantization. SmoothQuant (Xiao et al., 2022) addresses W8A8 — simultaneously quantizing both weights and activations to leverage INT8 Tensor Cores for acceleration.

The difficulty of activation quantization lies in outlier channels: some channels have a value range 100x or more larger than other channels, making per-tensor INT8 quantization unable to preserve the precision of both large and small values simultaneously.

SmoothQuant’s core idea — transfer quantization difficulty from activations to weights:

Y=XW=(Xdiag(s)1)(diag(s)W)=XWY = X \cdot W = (X \cdot \text{diag}(s)^{-1}) \cdot (\text{diag}(s) \cdot W) = X' \cdot W'

The smoothing factor sj=max(Xj)αs_j = \max(|X_j|)^\alpha, where α=0.5\alpha = 0.5 is the empirically optimal value. After the transformation, the dynamic range across channels of XX' becomes similar, and while WW' values increase slightly, they remain manageable — both can be quantized with per-tensor INT8.

Step 1: Original Activation (channel 2 is outlier)
Activation X (FP16)0.5000.30048.20.8000.4000.60052.10.5000.7000.20045.70.9000.3000.50050.80.600Channel 2 dynamic range (~50) is 50× larger than other channels (~1)Per-tensor INT8 quantization: scale dominated by outlier → normal channel precision collapsePer-channel max:ch0: 0.7ch1: 0.6ch2: 52.1ch3: 0.9

SmoothQuant is most valuable in data center scenarios: W8A8 directly utilizes GPU INT8 Tensor Cores (A100/H100), achieving 1.5-2x throughput improvement compared to FP16.

Method Comparison and Selection Guide

Calib DataQuant TimeBit WidthPPL (4-bit)FrameworkUse CaseRTN0<1 minW4/W8~7.2UniversalQuick baselineGPTQ128~10 minW4/W3/W2~5.67ExLlama/vLLM4-bit deployAWQ128~5 minW4~5.60vLLM/TRT-LLMEfficientSmoothQuant512~15 minW8A8~5.73TRT-LLMHigh throughputClick a cell to see detailed explanation

Selection recommendations:

  • Consumer GPU deployment (RTX 3090/4090): Choose AWQ or GPTQ with INT4-g128, use vLLM or ExLlamaV2
  • Data center high throughput (A100/H100): Choose SmoothQuant W8A8, use TensorRT-LLM
  • Quick evaluation: Start with RTN INT8 for validation, then switch to more refined methods
  • Ultra-low bit (2-3 bit): GPTQ supports it but with significant quality loss — evaluate whether QAT is needed

Summary

PTQ weight quantization has developed a mature toolchain. GPTQ achieves high-quality column-by-column quantization through Hessian error compensation, AWQ uses activation-aware per-channel scaling to protect critical channels, and SmoothQuant makes W8A8 possible through mathematical transformations. Each has its own focus, and actual deployment should be based on hardware conditions and accuracy requirements.