MQA and GQA
Updated 2026-04-06
Introduction: Why MHA’s KV Cache Is the Bottleneck
In the previous article, we learned how Multi-Head Attention (MHA) works: heads each have their own independent , , projection matrices, computing attention in parallel across different subspaces.
MHA performs excellently during training — all tokens can be processed in parallel. But during inference (autoregressive generation), a serious efficiency bottleneck emerges: the KV Cache.
What Is KV Cache
In autoregressive generation, each new token needs to compute attention with all preceding tokens. To avoid redundant computation, we cache the Key and Value vectors of previous tokens — this is the KV Cache.
For standard MHA, the KV Cache size per layer is:
Where is the number of heads, is the sequence length, is the per-head dimension, and the factor of 2 accounts for both K and V caches.
Taking LLaMA-2 70B as an example (, , layers), when generating a sequence of length , the KV Cache per request is:
This is even larger than the GPU memory occupied by the model parameters themselves! When serving multiple users simultaneously (batch serving), the KV Cache quickly exhausts GPU memory, becoming the core bottleneck for throughput.
Key observation: The KV Cache size is proportional to the number of heads . If we can reduce the number of KV heads that need to be cached, we can directly shrink the KV Cache.
MQA: The Most Aggressive KV Reduction
Multi-Query Attention (MQA) was proposed by Noam Shazeer in 2019 (“Fast Transformer Decoding: One Write-Head is All You Need”) and represents the most aggressive KV reduction approach.
Core Idea
The modification in MQA is simple: all query heads share a single set of Key and Value.
- Each head still has its own independent (Query projections remain distinct)
- But all heads share one and one
Mathematical formulation:
Note that and no longer have the subscript — they are shared across all heads.
KV Cache Reduction
Since there is only one set of KV, the cache size shrinks from:
A reduction of times! For a model with , this means the KV Cache shrinks to of the original.
The Cost
MQA’s reduction is extreme and comes with notable drawbacks:
- Quality degradation: All heads are forced to compute attention in the same KV subspace, losing MHA’s ability for “different heads to attend to different patterns”
- Training instability: Training an MQA model from scratch may be harder to converge
- The paper reports “only minor quality degradation,” but in practice, quality loss can be more noticeable on downstream tasks, especially those requiring fine-grained reasoning
GQA: The Grouped-Query Compromise
Grouped-Query Attention (GQA) was proposed by Ainslie et al. in 2023, offering an elegant compromise between MHA and MQA.
Core Idea
Divide the query heads into groups, with each group sharing a pair of KV heads.
- When (one head per group) → degenerates to standard MHA
- When (all heads in one group) → degenerates to MQA
- When → GQA, the compromise approach
Mathematical formulation (where the -th query head belongs to group ):
KV Cache Reduction
A reduction of times. For example, with and , the KV Cache shrinks to of the original.
Key Innovation: Uptraining
Another important contribution of the GQA paper is proposing a method to uptrain from an existing MHA checkpoint to GQA:
- Initialize the shared KV head by taking the mean of the multiple KV head weights within each group from the original MHA model
- Only about 5% of the original pretraining compute is needed to complete the conversion
- The converted model’s quality approaches the original MHA, while inference speed approaches MQA
This means there is no need to train from scratch — existing MHA models can be efficiently converted to GQA models.
Original MHA model has 8 independent KV heads, each with its own weight matrix.
Structural Comparison: MHA vs MQA vs GQA
Standard Multi-Head Attention: 4 Q heads each have independent KV heads.
The diagram above shows the head-to-KV mapping relationships for the three attention mechanisms (using as an example):
- MHA: 4 Query heads each map to 1 independent KV head (4 KV heads total)
- GQA (): 4 Query heads are divided into 2 groups, each sharing 1 KV head (2 KV heads total)
- MQA: All 4 Query heads share 1 KV head (1 KV head total)
KV Cache Memory Analysis: Concrete Numbers
Let us compute the KV Cache memory footprint using real model parameters. Assuming sequence length and FP16 (2 bytes per element):
| Model | Layers | KV heads | KV Cache / request | ||
|---|---|---|---|---|---|
| Hypothetical MHA-70B | 80 | 64 | 128 | 64 (MHA) | 10.7 GB |
| LLaMA-2 70B (GQA) | 80 | 64 | 128 | 8 | 1.3 GB |
| Hypothetical MQA-70B | 80 | 64 | 128 | 1 (MQA) | 0.17 GB |
Computation formula:
Taking LLaMA-2 70B’s GQA configuration as an example:
Comparison: From MHA’s 10.7 GB down to GQA’s 1.3 GB, a reduction of approximately 8x (), and with MQA it could be reduced 64x to just 0.17 GB.
Impact on Batch Serving
The impact of KV Cache reduction on batch inference is even more significant. Assuming the GPU has 40 GB of remaining memory available for KV Cache:
| Approach | KV Cache / request | Max concurrent requests |
|---|---|---|
| MHA | 10.7 GB | ~3 |
| GQA (8 groups) | 1.3 GB | ~30 |
| MQA | 0.17 GB | ~235 |
GQA increases concurrency capacity by approximately 10x — this has a decisive impact on cost and latency for LLM serving.
Based on LLaMA-2 70B parameters (L=80, h=64, d_k=128, GQA kv_heads=8), FP16
Quality vs Performance Trade-off
Reducing the number of KV heads is fundamentally a form of information compression: forcing multiple query heads to find different attention patterns within the same KV subspace.
Why GQA’s Quality Loss Is Small
- Redundancy: Research has found that adjacent heads’ KV projections in MHA are often highly similar — many heads learn redundant KV representations
- Query diversity preserved: GQA retains the independence of all query heads; only the KV space is shared. Query projections can still learn different attention patterns within the shared KV space
- Uptraining effectiveness: Initializing from an MHA checkpoint via mean pooling + a small amount of continued training can efficiently recover quality
The GQA paper reports that with approximately 5% of the original pretraining compute for uptraining, the GQA model performs close to the original MHA model on most benchmarks, while achieving inference speeds close to MQA.
Sources of Speed Improvement
The speed improvements from KV Cache reduction come primarily from two aspects:
- Memory bandwidth: Autoregressive decoding is a memory-bandwidth-bound operation. A smaller KV Cache means less data needs to be loaded per generation step, directly improving generation speed
- Memory capacity: A smaller KV Cache allows larger batch sizes, improving GPU utilization and overall throughput
Real-World Adoption
GQA has become the standard configuration in current mainstream large language models:
| Model | Query Heads | KV Heads | Group Ratio (h/g) | Attention Type |
|---|---|---|---|---|
| LLaMA-2 7B | 32 | 32 | 1:1 | MHA |
| LLaMA-2 13B | 40 | 40 | 1:1 | MHA |
| LLaMA-2 70B | 64 | 8 | 8:1 | GQA |
| LLaMA-3 8B | 32 | 8 | 4:1 | GQA |
| LLaMA-3 70B | 64 | 8 | 8:1 | GQA |
| Mistral 7B | 32 | 8 | 4:1 | GQA |
| Gemini 1.0 Pro | — | 1 | — | MQA |
Notable trends:
- LLaMA-2 series: Only the largest 70B model uses GQA, while the smaller 7B and 13B still use standard MHA. This indicates that at the time, the KV Cache bottleneck was primarily a concern for large models
- LLaMA-3 series: All sizes (including 8B) adopt GQA, reflecting that GQA has been proven effective across all scales
- Mistral 7B: Uses GQA (4:1) even at the 7B scale, combined with sliding window attention to further optimize inference efficiency
- Gemini 1.0 Pro: Uses the more aggressive MQA approach, with all query heads sharing a single KV head
- Industry consensus: GQA has become the default choice for new models, with 8 KV heads being a common configuration
PyTorch Implementation Notes
Implementing GQA requires only minor modifications to standard MHA:
# GQA: g KV heads, h query heads, each group of h//g queries shares one KV
class GroupedQueryAttention(nn.Module):
def __init__(self, H, h, g, d_k):
super().__init__()
self.h = h # number of query heads
self.g = g # number of KV heads (number of groups)
self.d_k = d_k
self.W_q = nn.Linear(H, h * d_k) # h query heads
self.W_k = nn.Linear(H, g * d_k) # g KV heads
self.W_v = nn.Linear(H, g * d_k) # g KV heads
self.W_o = nn.Linear(h * d_k, H)
def forward(self, x):
B, S, _ = x.shape
Q = self.W_q(x).view(B, S, self.h, self.d_k).transpose(1, 2)
K = self.W_k(x).view(B, S, self.g, self.d_k).transpose(1, 2)
V = self.W_v(x).view(B, S, self.g, self.d_k).transpose(1, 2)
# Key step: expand KV heads to match the number of query heads
# Each KV head is repeated h//g times
repeats = self.h // self.g
K = K.repeat_interleave(repeats, dim=1) # (B, h, S, d_k)
V = V.repeat_interleave(repeats, dim=1) # (B, h, S, d_k)
# Standard attention computation
scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_k)
weights = F.softmax(scores, dim=-1)
output = weights @ V
output = output.transpose(1, 2).contiguous().view(B, S, -1)
return self.W_o(output)
The key difference is that W_k and W_v have output dimension (instead of ), and then repeat_interleave is used to replicate each KV head times to match the number of query heads. Note that this replication does not increase the KV Cache size — only KV heads are actually cached.
Summary
| Concept | Description |
|---|---|
| MHA bottleneck | KV Cache grows linearly with the number of heads, limiting inference efficiency and concurrency |
| MQA | All query heads share one KV pair, KV Cache reduced by times, but with noticeable quality loss |
| GQA | query heads divided into groups, each sharing KV — a compromise approach |
| KV Cache reduction | MHA: → GQA: → MQA: |
| Uptraining | Convert from an MHA checkpoint using ~5% of compute |
| Industry trend | GQA has become the standard configuration in LLaMA-3, Mistral, and other mainstream models |
Core intuition: A significant amount of information redundancy exists among KV heads in MHA. GQA allows a group of query heads to share the same KV head pair, reducing the KV Cache by several times with virtually no quality loss, thereby dramatically improving inference efficiency and serving concurrency. It is like a team meeting — not everyone needs to bring their own complete set of meeting materials; a few people can share one copy. What is saved is desk space (GPU memory), without affecting discussion quality (model capability).