Qwen3-Coder-Next Architecture: When SSM, Attention, and MoE Converge
Updated 2026-04-13
The previous three articles introduced MoE, SSM, and Hybrid architectures as separate concepts. This article uses a real production model β Qwen3-Coder-Next 80B β to show how all three work together in a single architecture.
Qwen3-Coder-Next is a 79.7B parameter Hybrid MoE model that activates only ~3B parameters per token (3.8% utilization). It fuses GatedDeltaNet (a delta-rule-based linear attention), standard full attention, and 512-expert MoE across 48 layers β one of the most complex hybrid architectures to date.
1. Architecture Overview
1.1 Key Parameters
| Parameter | Value | Notes |
|---|---|---|
| Total layers | 48 | block_count=48 |
| Layer types | 36 recurrent + 12 full attention | 3:1 alternating |
| Hidden dim | 2048 | Much smaller than typical LLMs (LLaMA-70B uses 8192) |
| Total params | 79.7B | |
| Active params/token | ~3B | Only 3.8% activated |
| MoE config | 512 experts, top-10 | + 1 shared expert |
| Expert FFN dim | 512 | Small but numerous |
| Context length | 262,144 | 256K tokens |
1.2 The 3:1 Hybrid Layer Pattern
The 48 layers follow a fixed alternating pattern β every 4 layers form a cycle with 3 GatedDeltaNet (linear attention) layers followed by 1 full attention layer:
Layer 0: GatedDeltaNet β recurrent
Layer 1: GatedDeltaNet
Layer 2: GatedDeltaNet
Layer 3: Full Attention β every 4th layer
Layer 4: GatedDeltaNet
...
Layer 47: Full Attention β last layer is also full attention
Compared to Jambaβs 7:1 (Mamba:Attention) ratio, Qwen3-Coder-Next gives full attention a higher share (25% vs 12.5%), likely optimized for coding tasks that require precise token retrieval.
Key design: All 48 layers share the same MoE FFN structure regardless of attention type. The layer type difference is only in the attention component.
1.3 GGUF Tensor View: What Does a Layer Look Like?
In llama.cppβs GGUF format, each layer consists of a set of named tensors. Below are two representative blocks from the actual GGUF file β one GatedDeltaNet layer and one Full Attention layer β giving you a concrete view of each layerβs βphysical composition.β
Block 0 (GatedDeltaNet layer):
Tensor Shape Quant Role
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# ββ Attention/SSM ββ
blk.0.attn_norm.weight [2048] F32 Pre-attention RMSNorm
blk.0.attn_qkv.weight [2048, 8192] Q4_K Combined Q+K+V projection
blk.0.attn_gate.weight [2048, 4096] Q4_K Output gate vector z
blk.0.ssm_ba.weight [2048, 64] Q4_K Combined Ξ²+Ξ± projection
blk.0.ssm_conv1d.weight [4, 8192] F32 1D causal conv kernel
blk.0.ssm_dt [32] F32 Time-step bias (dt_bias)
blk.0.ssm_a [32] F32 State decay coeff (-exp(A))
blk.0.ssm_norm.weight [128] F32 SSM output RMSNorm
blk.0.ssm_out.weight [4096, 2048] Q4_K Output projection
# ββ MoE (same structure in all layers) ββ
blk.0.post_attention_norm.weight [2048] F32 Pre-MoE RMSNorm
blk.0.ffn_gate_inp.weight [2048, 512] F32 Router (scores 512 experts)
blk.0.ffn_gate_exps.weight [2048, 512, 512] Q4_K 512 Expert gate projections
blk.0.ffn_up_exps.weight [2048, 512, 512] Q4_K 512 Expert up projections
blk.0.ffn_down_exps.weight [512, 2048, 512] Q6_K 512 Expert down projections
blk.0.ffn_gate_shexp.weight [2048, 512] Q4_K Shared Expert gate
blk.0.ffn_up_shexp.weight [2048, 512] Q4_K Shared Expert up
blk.0.ffn_down_shexp.weight [512, 2048] Q6_K Shared Expert down
blk.0.ffn_gate_inp_shexp.weight [2048, 1] F16 Shared Expert sigmoid gate
Block 3 (Full Attention layer):
Tensor Shape Quant Role
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# ββ Attention (entirely different from Block 0) ββ
blk.3.attn_norm.weight [2048] F32 Pre-attention RMSNorm
blk.3.attn_q.weight [2048, 8192] Q4_K Q projection (includes gate)
blk.3.attn_k.weight [2048, 512] Q4_K K projection (2 heads Γ 256)
blk.3.attn_v.weight [2048, 512] Q6_K V projection (2 heads Γ 256)
blk.3.attn_q_norm.weight [256] F32 Q RMSNorm
blk.3.attn_k_norm.weight [256] F32 K RMSNorm
blk.3.attn_output.weight [4096, 2048] Q4_K Attention output projection
# ββ MoE (identical to Block 0, omitted) ββ
blk.3.ffn_* ... (same as above)
Key differences between the two layer types:
| Difference | GatedDeltaNet (Block 0) | Full Attention (Block 3) |
|---|---|---|
| QKV projection | Combined attn_qkv [2048β8192] | Separate attn_q/k/v |
| Gating | attn_gate (output gate z) | Sigmoid gate built into Q |
| Unique tensors | ssm_* (6 SSM-specific tensors) | attn_q/k_norm (QK normalization) |
| Output proj | ssm_out [4096β2048] | attn_output [4096β2048] |
| F32 precision | ssm_a, ssm_dt, ssm_conv1d all F32 | attn_v uses Q6_K |
Note that MoE tensors (ffn_* prefix) are identical across both layer types β this is the concrete manifestation of the design principle from section 1.2: βlayer type difference is only in the attention component.β
2. GatedDeltaNet: Linear Attention Layers
The 36 recurrent layers use GatedDeltaNet rather than standard Mamba β a linear attention mechanism based on the delta rule, with a matrix state instead of a vector.
2.1 From Mamba to Delta Rule
Recall Mambaβs state update from the SSM article:
The state is a vector () with additive updates.
DeltaNet (Yang et al., 2024) promotes the state to a matrix with delta rule updates:
The state matrix can be viewed as an associative memory β a βfuzzy key-value store.β Expanding the formula into its equivalent delta form makes the update semantics clearer:
Each term has a clear meaning:
- : retrieve from the state matrix using current key β the βold value currently associated with in memoryβ
- : the new value from the current tokenβs input projection β the target we want to associate with
- : the delta (correction) β the error between the new target and old memory
- : write this correction back into the state matrix as an outer product
This is where the βdelta ruleβ name comes from β instead of writing the new value directly, it writes only the correction . Note that and are both vectors but with entirely different meanings: the former comes from the current input projection, the latter is an old value retrieved from the compressed historical state.
Why does retrieve the βold valueβ?
If youβre familiar with standard Attention (), you might wonder: donβt you first compute QK similarity, then multiply by V to get the value? Why does multiplying by directly produce a value?
The key is how is constructed. Taking the simplest additive update: . Unrolled, is the sum of outer products of all historical key-value pairs:
When you query with the current :
The right-hand side is a weighted sum of all historical values, weighted by key-key dot product similarity β essentially the same operation as standard Attentionβs , except: (1) no softmax normalization; (2) all historical K/V are compressed into a fixed-size matrix , so retrieval is a single matrix-vector multiply at cost instead of .
Comparing the two update strategies:
| Strategy | Update rule | Effect |
|---|---|---|
| Additive (Mamba) | Old and new values accumulate, information blurs | |
| Delta rule | Old value corrected toward new value, precise overwrite |
When , the update yields β perfect overwrite. This gives DeltaNet an advantage over Mamba in tasks requiring precise recall of recent key-value pairs (e.g., in-context recall).
Gated DeltaNet (Yang, Kautz & Hatamizadeh, 2024) adds element-wise decay gating on top of the delta rule:
The difference from DeltaNet: before updating, the old state undergoes per-element exponential decay ( is input-dependent, giving each state element its own decay rate), then the delta rule update is applied, and finally a query vector reads the output. The complete five-step process:
- Decay: β apply per-element exponential decay to old state
- Retrieve old value: β use to query the old value from the decayed state
- Compute delta: β error between new target and old memory
- Write correction: β write the correction back as an outer product
- Read output: β use to retrieve this layerβs output from the updated state
Note that and both come from the same projection (attn_qkv) but serve different roles: is the write address (determines where in the state matrix a value is stored), is the read address (determines what information is retrieved as output). Separating the two allows writing and reading to focus on different features β the same Q/K division of labor as in standard Attention.
Decay gating lets the model actively forget information no longer needed ( causes the old state to gradually decay), while the delta rule ensures precise overwrite for information that should be remembered β flexible forgetting + precise overwriting, the core advantage of GatedDeltaNet over pure Mamba.
2.2 Complete Data Flow and Input Processing
Section 2.1 presented the complete Gated DeltaNet formula (five steps: decay β retrieve β delta β write β read), but thereβs a gap between formulas and the actual model: which GGUF tensor corresponds to which step? What transformations does data undergo from input to output? This section provides the overall flow diagram; subsequent sections (2.3β2.7) expand on each step in detail.
Forward Pass Flow
ββββββββββββββββββ GatedDeltaNet Layer Forward Pass βββββββββββββββββββ
β β
β β Three parallel projections β
β x ββ attn_qkv [2048β8192] ββ qkv [8192] β
β x ββ attn_gate [2048β4096] ββ z [4096] (output gate) β
β x ββ ssm_ba [2048β64] ββ Ξ²+Ξ± [64] (decay + write) β
β β
β β‘ Conv1d temporal fusion β
β [conv_state; qkv] ββ Conv1d(k=4) ββ SiLU ββ qkv' β
β β cached last 3 tokens β
β β
β β’ Split + preprocessing β
β qkv' β Q [128Γ16] + K [128Γ16] + V [128Γ32] β
β Q β L2Norm(Q) / βd_v β q_t in formula β
β K β L2Norm(K) β k_t in formula β
β Q, K repeat 16β32 heads β align with V's 32 heads β
β Ξ² β sigmoid(Ξ²_raw) β Ξ²_t β [0,1] in formula β
β gate β softplus(Ξ± + dt_bias) Γ (-A) β
β β exp(Ξ±_t) in formula β
β β
β β£ Delta Rule state update β
β HΜ = exp(gate) β H_{t-1} β delta_state [128,4096] β
β H_t = HΜ + Ξ² Β· k_t(v_t β HΜk_t)α΅ β
β out_t = H_t Β· q_t β
β β
β β€ Output gating β
β output = ssm_out( ssm_norm(out) β SiLU(z) ) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tensor-to-Formula Mapping
| GGUF Tensor | Shape | Quant | Formula / Role |
|---|---|---|---|
attn_qkv | Q4_K | Produces raw (split after Conv1d) | |
attn_gate | Q4_K | Output gate , controls SSM output flow in step β€ | |
ssm_conv1d | F32 | Causal conv kernel, fuses last 3 tokensβ context in step β‘ | |
ssm_ba | Q4_K | Produces (write strength) and (decay rate) | |
ssm_dt | F32 | Time-step bias , added to in step β’ | |
ssm_a | F32 | , ensures for decay | |
ssm_norm | F32 | Output RMSNorm in step β€ (note: NOT zero-centered variant) | |
ssm_out | Q4_K | Projects gated output back to hidden dim in step β€ |
Input Projections (Step β )
Each layer performs three parallel projections on input :
| Projection | Dimensions | Quant | Output split |
|---|---|---|---|
attn_qkv | Q4_K | Q + K + V | |
attn_gate | Q4_K | Gate vector | |
ssm_ba | Q4_K | Beta + Alpha combined (grouped by K-head) |
Note the asymmetric head structure: Q heads = K heads = 16, V heads = 32, head_dim = 128. Q and K share the same head count, but V has 2x more.
This creates a mismatch: the delta rule state update operates per-head, so Q/K/V head counts must align. The solution is to repeat Q and K from 16 heads to 32 heads β each Q/K-head is duplicated twice to match Vβs 32 heads. This mirrors the GQA approach from full attention: fewer K-heads are shared across multiple βworking heads,β saving Q/K projection parameters.
After expansion, the model has 32 independent delta rule states, each , for a total state size of . More V-heads mean larger state capacity, while Q/K serve only as βaddressesβ and need fewer parameters.
Why Conv1d? (Step β‘)
The attn_qkv projection produces an 8192-dimensional QKV vector that only contains information from the current token . But in the delta rule state update (step β£), the quality of and directly determines how good the key-value associations written to the state matrix are. If depend entirely on a single token, expressiveness is limited.
Conv1d addresses this: with a kernel=4 causal convolution, the current tokenβs QKV incorporates information from the previous 3 tokens before entering the delta rule:
This provides two key benefits:
- Lossless short-range context: The last 3 tokensβ information is directly injected into Q/K/V via convolution, bypassing the lossy compression of state matrix . For dependencies within the 4-token window, the model has access to precise, original information
- Complementary to recurrence: Conv1d handles precise short-range dependencies (4-token window), delta rule handles long-range dependencies (compressed via state matrix ) β clear division of labor
ssm_conv1d.weight [4, 8192] is a depthwise convolution kernel β each of the 8192 channels has its own independent 4-element kernel. It doesnβt mix information across Q/K/V channels; it only fuses along the time dimension. At runtime, a conv_state [3, 8192] cache (see section 2.7) stores the previous 3 tokensβ qkv values.
This project β conv1d β recurrence three-stage pipeline is inherited from Mamba β the same pattern introduced in the SSM article. After Conv1d + SiLU, the output is split into Q, K, V vectors for subsequent normalization and state update.
2.3 Q/K Normalization and Beta Activation
Formula mapping: Preprocessing (normalization) and (activation) β preparing variables before they enter the delta rule.
After Conv1d splits out Q, K, V, several additional processing steps are needed before entering the delta rule:
- L2 normalization: Q and K are each L2-normalized, projecting them onto the unit sphere to stabilize delta rule key-value associations
- Q scaling: , preventing dot products from growing with dimension
- Beta sigmoid: , constraining write strength to
2.4 Decay Gating
Formula mapping: Step 1βs β constructing the decay coefficient that controls per-element forgetting of the old state.
The decay gate and write strength come from a single combined projection:
| Parameter | Shape | Meaning |
|---|---|---|
| ssm_ba | Combined beta+alpha projection (grouped by K-head) | |
| ssm_dt | Time-step bias (learnable) | |
| ssm_a | State decay coefficient, stored as |
The 64-dimensional output of ssm_ba is split by K-head groups: 16 K-heads Γ 4 dims = 64, where each group of 4 contains 2 beta values (for the 2 V-heads mapped to that K-head) and 2 alpha values. After splitting, both beta and alpha have 32 values (one per V-head).
This ensures for exponential decay, similar to Mambaβs parameter.
2.5 Chunked Parallel Computation
Formula mapping: Executing steps 1β5 β decay, state update , and output .
Autoregressive mode (decode, 1 token at a time) directly updates the state:
Chunked parallel mode (prefill) splits the sequence into chunks of 64 tokens with decayed causal attention within each chunk:
State propagates across chunks, similar to Mamba-2βs SSD algorithm.
2.6 Output Gating
Formula mapping: Post-processing after β the raw delta rule output needs gating and projection to become the layer output.
SSM output goes through RMSNorm then element-wise multiplication with the SiLU-activated gate:
Finally projected back to hidden dimension via ssm_out ().
2.7 Cache
Formula mapping: Persistent storage of state matrix and Conv1d history across tokens β the foundation that enables the recurrence to carry information across tokens.
Each layer maintains two fixed-size states:
| Cache type | Shape | Size (f32) | Purpose |
|---|---|---|---|
| delta_state | ~2 MB/seq | Recurrent state matrix | |
| conv_state | ~96 KB/seq | Conv1d history (last 3 tokens) |
Unlike full attentionβs KV cache (grows linearly with sequence length), delta_state is fixed size β the core SSM advantage.
3. Gated Full Attention Layers
The full attention layers (every 4th layer) use sigmoid-gated GQA with several key differences from standard Transformer attention.
3.1 Gated Q Projection
The Q projection outputs 2Γ the head_dim β half for query, half for gate:
After attention computation, the output is multiplied by sigmoid of the gate:
3.2 GQA: 16 Q Heads, 2 KV Heads
| Projection | Dimensions | Quantization | Head config |
|---|---|---|---|
| attn_k | Q4_K | 2 heads Γ 256 head_dim | |
| attn_v | Q6_K | 2 heads Γ 256 head_dim |
This is 8:1 GQA (Grouped-Query Attention) β 16 Q heads share 2 KV groups.
3.3 Partial RoPE
Positional encoding applies to only the first 64 of 256 head dimensions:
75% of head dimensions remain position-agnostic, focusing on semantic content. RoPE uses NeoX layout with .
3.4 QK Norm
Both Q and K pass through RMSNorm before RoPE:
Essential for stabilizing attention scores at head_dim = 256.
4. MoE: 512 Expert Top-10 Routing
All 48 layers use the same MoE FFN structure.
4.1 Router
A full-precision (F32) linear projection maps hidden states to 512-dimensional expert space:
Selected routing weights are renormalized (norm_top_k_prob=true):
4.2 Expert FFN
Each expert is a small SiLU-gated FFN:
| Projection | Dimensions | Quant | Per-layer size (Γ512) |
|---|---|---|---|
| ffn_gate_exps | Q4_K | 288 MB | |
| ffn_up_exps | Q4_K | 288 MB | |
| ffn_down_exps | Q6_K | 420 MB | |
| Total | 996 MB |
4.3 Shared Expert
A DeepSeek-style shared expert that all tokens pass through, with sigmoid gating:
Final MoE output = weighted sum of routed experts + shared expert output.
4.4 Sparsity Analysis
| Metric | Qwen3-CN | DeepSeek-V3 | Mixtral |
|---|---|---|---|
| Expert count | 512 | 256 | 8 |
| Top-K | 10 | 8 | 2 |
| Activation ratio | 1.95% | 3.1% | 25% |
| Total params | 79.7B | 671B | 47B |
| Active params | ~3B | ~37B | ~13B |
5. Layer Structure and Data Flow
5.1 Full Forward Pass
Each layer follows a Pre-Norm dual-residual structure:
Input x [2048]
β
ββ RMSNorm (attn_norm)
β ββ QKV Projection β Conv1d(k=4) + SiLU β Q, K, V
β ββ Gate Projection β z
β ββ Beta/Alpha β Ξ², gate
β ββ State Update: H = exp(gate) β H_prev + Ξ²Β·kΒ·vα΅
β ββ Output: out = H Β· q
β ββ Gated: RMSNorm(out) β SiLU(z)
β ββ SSM Out Projection [4096 β 2048]
β
ββ Residual: hβ = x + attn_out
β
ββ RMSNorm (post_attention_norm)
β ββ Router [2048 β 512] β Softmax β TopK(10)
β ββ 10Γ Expert FFN: SiLU(gateΒ·x) β upΒ·x β down
β ββ Weighted Sum (normalized weights)
β ββ Shared Expert FFN + Sigmoid Gate
β ββ moe_out = routed + shared
β
ββ Residual: hβ = hβ + moe_out β Output [2048]
5.2 Zero-Centered RMSNorm
All RMSNorm layers (except internal ssm_norm) use the zero-centered variant:
Weights are stored with +1 offset. When , normalization becomes the identity β aiding training stability.
5.3 Embedding and Output
| Component | Dimensions | Quantization |
|---|---|---|
| token_embd | Q4_K | |
| output_norm | F32 | |
| output | Q6_K |
6. Quantization and Weight Distribution
6.1 Q4_K_M Mixed Quantization
| Component | Quantization | Reason |
|---|---|---|
| Expert gate/up | Q4_K | Most numerous, aggressive compression |
| Expert down, attn_v | Q6_K | Directly affects residual precision |
| Router, Norms | F32 | Precision-critical |
| SSM params (A, dt, conv1d) | F32 | Decay rates need high precision |
6.2 Weight Distribution: 97% is MoE Expert
Per-layer breakdown:
| Component | Per-layer size | Share |
|---|---|---|
| MoE Experts (512) | ~996 MB | 97% |
| Dense (Attention/SSM + Shared Expert + Norms) | ~25 MB | 3% |
Full model:
| Component | Size | Share |
|---|---|---|
| 48 layers MoE experts | ~47.8 GB | 96% |
| 48 layers dense | ~1.2 GB | 2.4% |
| Embedding + Output | ~0.4 GB | 0.8% |
| Total | ~49.4 GB |
Dense weights are only 1.2 GB β easily fits in any GPU. MoE experts dominate and are the primary target for GPU/CPU offloading strategies.
7. Comparison with Other Hybrid Models
| Feature | Jamba 52B | Zamba2 2.7B | Hymba 1.5B | Qwen3-CN 80B |
|---|---|---|---|---|
| SSM type | Mamba | Mamba2 | Mamba2 | GatedDeltaNet |
| Fusion | Interleaved 7:1 | Shared Attn | Parallel heads | Interleaved 3:1 |
| MoE | 16E, top-2 | None | None | 512E, top-10 |
| Total params | 52B | 2.7B | 1.5B | 79.7B |
| Active params | 12B | 2.7B | 1.5B | ~3B |
| Attn layer share | 12.5% | ~15% | 50% | 25% |
| State type | Vector | Vector | Vector+KV | Matrix |
Summary
Qwen3-Coder-Next 80B deeply fuses SSM, Attention, and MoE:
| Component | Technical choice | Role |
|---|---|---|
| 36 GatedDeltaNet layers | Matrix state + delta rule + exp decay gating | Linear-complexity long-sequence modeling |
| 12 Full Attention layers | Sigmoid gating + 8:1 GQA + Partial RoPE | Precise token retrieval and ICL |
| 48 MoE layers | 512 experts top-10 + shared expert | Parameter efficiency: 80B params, 3B compute |
This design reflects the current trend: combining complementary mechanisms rather than pushing a single one to its extreme. Linear attention provides efficiency, full attention provides precision, MoE provides parameter-compute decoupling β each compensating for the othersβ weaknesses.
Further Reading
- Mixture of Experts β MoE routing, load balancing, DeepSeek innovations
- State Space Models and Mamba β SSM basics, selective mechanism, SSD duality
- Hybrid Architectures β Jamba, Zamba2, Hymba design comparison
- GQA and MQA β Grouped-Query Attention principles
- KV Cache β Caching mechanisms and optimization