Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Qwen3-Coder-Next Architecture: When SSM, Attention, and MoE Converge

Qwen3-Coder-Next Architecture: When SSM, Attention, and MoE Converge

Updated 2026-04-13

The previous three articles introduced MoE, SSM, and Hybrid architectures as separate concepts. This article uses a real production model β€” Qwen3-Coder-Next 80B β€” to show how all three work together in a single architecture.

Qwen3-Coder-Next is a 79.7B parameter Hybrid MoE model that activates only ~3B parameters per token (3.8% utilization). It fuses GatedDeltaNet (a delta-rule-based linear attention), standard full attention, and 512-expert MoE across 48 layers β€” one of the most complex hybrid architectures to date.


1. Architecture Overview

1.1 Key Parameters

ParameterValueNotes
Total layers48block_count=48
Layer types36 recurrent + 12 full attention3:1 alternating
Hidden dim2048Much smaller than typical LLMs (LLaMA-70B uses 8192)
Total params79.7B
Active params/token~3BOnly 3.8% activated
MoE config512 experts, top-10+ 1 shared expert
Expert FFN dim512Small but numerous
Context length262,144256K tokens

1.2 The 3:1 Hybrid Layer Pattern

The 48 layers follow a fixed alternating pattern β€” every 4 layers form a cycle with 3 GatedDeltaNet (linear attention) layers followed by 1 full attention layer:

Layer  0: GatedDeltaNet    ← recurrent
Layer  1: GatedDeltaNet
Layer  2: GatedDeltaNet
Layer  3: Full Attention    ← every 4th layer
Layer  4: GatedDeltaNet
...
Layer 47: Full Attention    ← last layer is also full attention

Compared to Jamba’s 7:1 (Mamba:Attention) ratio, Qwen3-Coder-Next gives full attention a higher share (25% vs 12.5%), likely optimized for coding tasks that require precise token retrieval.

Key design: All 48 layers share the same MoE FFN structure regardless of attention type. The layer type difference is only in the attention component.

1.3 GGUF Tensor View: What Does a Layer Look Like?

In llama.cpp’s GGUF format, each layer consists of a set of named tensors. Below are two representative blocks from the actual GGUF file β€” one GatedDeltaNet layer and one Full Attention layer β€” giving you a concrete view of each layer’s β€œphysical composition.”

Block 0 (GatedDeltaNet layer):

Tensor                            Shape             Quant   Role
───────────────────────────────────────────────────────────────────
  # ── Attention/SSM ──
blk.0.attn_norm.weight           [2048]            F32     Pre-attention RMSNorm
blk.0.attn_qkv.weight           [2048, 8192]      Q4_K    Combined Q+K+V projection
blk.0.attn_gate.weight           [2048, 4096]      Q4_K    Output gate vector z
blk.0.ssm_ba.weight              [2048, 64]        Q4_K    Combined Ξ²+Ξ± projection
blk.0.ssm_conv1d.weight          [4, 8192]         F32     1D causal conv kernel
blk.0.ssm_dt                     [32]              F32     Time-step bias (dt_bias)
blk.0.ssm_a                      [32]              F32     State decay coeff (-exp(A))
blk.0.ssm_norm.weight            [128]             F32     SSM output RMSNorm
blk.0.ssm_out.weight             [4096, 2048]      Q4_K    Output projection

  # ── MoE (same structure in all layers) ──
blk.0.post_attention_norm.weight [2048]            F32     Pre-MoE RMSNorm
blk.0.ffn_gate_inp.weight        [2048, 512]       F32     Router (scores 512 experts)
blk.0.ffn_gate_exps.weight       [2048, 512, 512]  Q4_K    512 Expert gate projections
blk.0.ffn_up_exps.weight         [2048, 512, 512]  Q4_K    512 Expert up projections
blk.0.ffn_down_exps.weight       [512, 2048, 512]  Q6_K    512 Expert down projections
blk.0.ffn_gate_shexp.weight      [2048, 512]       Q4_K    Shared Expert gate
blk.0.ffn_up_shexp.weight        [2048, 512]       Q4_K    Shared Expert up
blk.0.ffn_down_shexp.weight      [512, 2048]       Q6_K    Shared Expert down
blk.0.ffn_gate_inp_shexp.weight  [2048, 1]         F16     Shared Expert sigmoid gate

Block 3 (Full Attention layer):

Tensor                            Shape             Quant   Role
───────────────────────────────────────────────────────────────────
  # ── Attention (entirely different from Block 0) ──
blk.3.attn_norm.weight           [2048]            F32     Pre-attention RMSNorm
blk.3.attn_q.weight              [2048, 8192]      Q4_K    Q projection (includes gate)
blk.3.attn_k.weight              [2048, 512]       Q4_K    K projection (2 heads Γ— 256)
blk.3.attn_v.weight              [2048, 512]       Q6_K    V projection (2 heads Γ— 256)
blk.3.attn_q_norm.weight         [256]             F32     Q RMSNorm
blk.3.attn_k_norm.weight         [256]             F32     K RMSNorm
blk.3.attn_output.weight         [4096, 2048]      Q4_K    Attention output projection

  # ── MoE (identical to Block 0, omitted) ──
blk.3.ffn_*                      ...                       (same as above)

Key differences between the two layer types:

DifferenceGatedDeltaNet (Block 0)Full Attention (Block 3)
QKV projectionCombined attn_qkv [2048β†’8192]Separate attn_q/k/v
Gatingattn_gate (output gate z)Sigmoid gate built into Q
Unique tensorsssm_* (6 SSM-specific tensors)attn_q/k_norm (QK normalization)
Output projssm_out [4096β†’2048]attn_output [4096β†’2048]
F32 precisionssm_a, ssm_dt, ssm_conv1d all F32attn_v uses Q6_K

Note that MoE tensors (ffn_* prefix) are identical across both layer types β€” this is the concrete manifestation of the design principle from section 1.2: β€œlayer type difference is only in the attention component.”


2. GatedDeltaNet: Linear Attention Layers

The 36 recurrent layers use GatedDeltaNet rather than standard Mamba β€” a linear attention mechanism based on the delta rule, with a matrix state instead of a vector.

2.1 From Mamba to Delta Rule

Recall Mamba’s state update from the SSM article:

ht=AΛ‰tβ‹…htβˆ’1+BΛ‰tβ‹…xth_t = \bar{A}_t \cdot h_{t-1} + \bar{B}_t \cdot x_t

The state hth_t is a vector (RN\mathbb{R}^N) with additive updates.

DeltaNet (Yang et al., 2024) promotes the state to a matrix Ht∈RdkΓ—dvH_t \in \mathbb{R}^{d_k \times d_v} with delta rule updates:

Ht=(Iβˆ’Ξ²tktkt⊀)Htβˆ’1+Ξ²tvtkt⊀H_t = (I - \beta_t k_t k_t^\top) H_{t-1} + \beta_t v_t k_t^\top

The state matrix HH can be viewed as an associative memory β€” a β€œfuzzy key-value store.” Expanding the formula into its equivalent delta form makes the update semantics clearer:

Ht=Htβˆ’1+Ξ²tβ‹…kt(vtβˆ’Htβˆ’1kt)⊀H_t = H_{t-1} + \beta_t \cdot k_t \left(v_t - H_{t-1} k_t\right)^\top

Each term has a clear meaning:

  • Htβˆ’1kt∈RdvH_{t-1} k_t \in \mathbb{R}^{d_v}: retrieve from the state matrix using current key ktk_t β€” the β€œold value currently associated with ktk_t in memory”
  • vt∈Rdvv_t \in \mathbb{R}^{d_v}: the new value from the current token’s input projection β€” the target we want to associate with ktk_t
  • vtβˆ’Htβˆ’1ktv_t - H_{t-1} k_t: the delta (correction) β€” the error between the new target and old memory
  • Ξ²tβ‹…kt(⋯ )⊀\beta_t \cdot k_t (\cdots)^\top: write this correction back into the state matrix as an outer product

This is where the β€œdelta rule” name comes from β€” instead of writing the new value vtv_t directly, it writes only the correction vtβˆ’Htβˆ’1ktv_t - H_{t-1}k_t. Note that vtv_t and Htβˆ’1ktH_{t-1}k_t are both Rdv\mathbb{R}^{d_v} vectors but with entirely different meanings: the former comes from the current input projection, the latter is an old value retrieved from the compressed historical state.

Why does Htβˆ’1ktH_{t-1}k_t retrieve the β€œold value”?

If you’re familiar with standard Attention (softmax(QK⊀)β‹…V\text{softmax}(QK^\top) \cdot V), you might wonder: don’t you first compute QK similarity, then multiply by V to get the value? Why does multiplying HH by kk directly produce a value?

The key is how HH is constructed. Taking the simplest additive update: Ht=Htβˆ’1+ktvt⊀H_t = H_{t-1} + k_t v_t^\top. Unrolled, HH is the sum of outer products of all historical key-value pairs:

Htβˆ’1=βˆ‘i=1tβˆ’1ki vi⊀H_{t-1} = \sum_{i=1}^{t-1} k_i \, v_i^\top

When you query with the current ktk_t:

Htβˆ’1kt=(βˆ‘i=1tβˆ’1kivi⊀)kt=βˆ‘i=1tβˆ’1(ki⊀kt)⏟keyΒ similarity viH_{t-1} k_t = \left(\sum_{i=1}^{t-1} k_i v_i^\top\right) k_t = \sum_{i=1}^{t-1} \underbrace{(k_i^\top k_t)}_{\text{key similarity}} \, v_i

The right-hand side is a weighted sum of all historical values, weighted by key-key dot product similarity β€” essentially the same operation as standard Attention’s βˆ‘softmax(q⊀ki)β‹…vi\sum \text{softmax}(q^\top k_i) \cdot v_i, except: (1) no softmax normalization; (2) all historical K/V are compressed into a fixed-size matrix HH, so retrieval is a single matrix-vector multiply at O(1)O(1) cost instead of O(n)O(n).

Comparing the two update strategies:

StrategyUpdate ruleEffect
Additive (Mamba)H+=vβ‹…k⊀H \mathrel{+}= v \cdot k^\topOld and new values accumulate, information blurs
Delta ruleH+=Ξ²β‹…k(vnewβˆ’Hk)⊀H \mathrel{+}= \beta \cdot k(v_{new} - Hk)^\topOld value corrected toward new value, precise overwrite

When Ξ²t=1\beta_t = 1, the update yields Htkt=vtH_t k_t = v_t β€” perfect overwrite. This gives DeltaNet an advantage over Mamba in tasks requiring precise recall of recent key-value pairs (e.g., in-context recall).

Gated DeltaNet (Yang, Kautz & Hatamizadeh, 2024) adds element-wise decay gating on top of the delta rule:

H~tβˆ’1=exp⁑(Ξ±t)βŠ™Htβˆ’1,Ht=H~tβˆ’1+Ξ²tβ‹…kt(vtβˆ’H~tβˆ’1kt)⊀,ot=Ht qt\tilde{H}_{t-1} = \exp(\alpha_t) \odot H_{t-1}, \quad H_t = \tilde{H}_{t-1} + \beta_t \cdot k_t \left(v_t - \tilde{H}_{t-1} k_t\right)^\top, \quad o_t = H_t \, q_t

The difference from DeltaNet: before updating, the old state undergoes per-element exponential decay (Ξ±t\alpha_t is input-dependent, giving each state element its own decay rate), then the delta rule update is applied, and finally a query vector qtq_t reads the output. The complete five-step process:

  1. Decay: H~tβˆ’1=exp⁑(Ξ±t)βŠ™Htβˆ’1\tilde{H}_{t-1} = \exp(\alpha_t) \odot H_{t-1} β€” apply per-element exponential decay to old state
  2. Retrieve old value: H~tβˆ’1kt\tilde{H}_{t-1} k_t β€” use ktk_t to query the old value from the decayed state
  3. Compute delta: vtβˆ’H~tβˆ’1ktv_t - \tilde{H}_{t-1} k_t β€” error between new target and old memory
  4. Write correction: Ht=H~tβˆ’1+Ξ²tβ‹…kt(⋯ )⊀H_t = \tilde{H}_{t-1} + \beta_t \cdot k_t (\cdots)^\top β€” write the correction back as an outer product
  5. Read output: ot=Ht qto_t = H_t \, q_t β€” use qtq_t to retrieve this layer’s output from the updated state

Note that qtq_t and ktk_t both come from the same projection (attn_qkv) but serve different roles: ktk_t is the write address (determines where in the state matrix a value is stored), qtq_t is the read address (determines what information is retrieved as output). Separating the two allows writing and reading to focus on different features β€” the same Q/K division of labor as in standard Attention.

Decay gating lets the model actively forget information no longer needed (exp⁑(Ξ±t)<1\exp(\alpha_t) < 1 causes the old state to gradually decay), while the delta rule ensures precise overwrite for information that should be remembered β€” flexible forgetting + precise overwriting, the core advantage of GatedDeltaNet over pure Mamba.

2.2 Complete Data Flow and Input Processing

Section 2.1 presented the complete Gated DeltaNet formula (five steps: decay β†’ retrieve β†’ delta β†’ write β†’ read), but there’s a gap between formulas and the actual model: which GGUF tensor corresponds to which step? What transformations does data undergo from input to output? This section provides the overall flow diagram; subsequent sections (2.3–2.7) expand on each step in detail.

Forward Pass Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ GatedDeltaNet Layer Forward Pass ──────────────────┐
β”‚                                                                      β”‚
β”‚  β‘  Three parallel projections                                        β”‚
β”‚     x ─→ attn_qkv  [2048β†’8192] ─→ qkv [8192]                      β”‚
β”‚     x ─→ attn_gate [2048β†’4096] ─→ z [4096]    (output gate)        β”‚
β”‚     x ─→ ssm_ba    [2048β†’64]  ─→ Ξ²+Ξ± [64]    (decay + write)      β”‚
β”‚                                                                      β”‚
β”‚  β‘‘ Conv1d temporal fusion                                            β”‚
β”‚     [conv_state; qkv] ─→ Conv1d(k=4) ─→ SiLU ─→ qkv'              β”‚
β”‚      ↑ cached last 3 tokens                                         β”‚
β”‚                                                                      β”‚
β”‚  β‘’ Split + preprocessing                                             β”‚
β”‚     qkv' β†’ Q [128Γ—16] + K [128Γ—16] + V [128Γ—32]                    β”‚
β”‚     Q ← L2Norm(Q) / √d_v     β†’ q_t in formula                      β”‚
β”‚     K ← L2Norm(K)            β†’ k_t in formula                      β”‚
β”‚     Q, K repeat 16β†’32 heads  β†’ align with V's 32 heads              β”‚
β”‚     Ξ² ← sigmoid(Ξ²_raw)       β†’ Ξ²_t ∈ [0,1] in formula              β”‚
β”‚     gate ← softplus(Ξ± + dt_bias) Γ— (-A)                             β”‚
β”‚                               β†’ exp(Ξ±_t) in formula                 β”‚
β”‚                                                                      β”‚
β”‚  β‘£ Delta Rule state update                                           β”‚
β”‚     HΜƒ = exp(gate) βŠ™ H_{t-1}       ← delta_state [128,4096]        β”‚
β”‚     H_t = HΜƒ + Ξ² Β· k_t(v_t βˆ’ HΜƒk_t)α΅€                               β”‚
β”‚     out_t = H_t Β· q_t                                                β”‚
β”‚                                                                      β”‚
β”‚  β‘€ Output gating                                                     β”‚
β”‚     output = ssm_out( ssm_norm(out) βŠ™ SiLU(z) )                     β”‚
β”‚                                                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Tensor-to-Formula Mapping

GGUF TensorShapeQuantFormula / Role
attn_qkv[2048,8192][2048, 8192]Q4_KProduces raw q,k,vq, k, v (split after Conv1d)
attn_gate[2048,4096][2048, 4096]Q4_KOutput gate zz, controls SSM output flow in step β‘€
ssm_conv1d[4,8192][4, 8192]F32Causal conv kernel, fuses last 3 tokens’ context in step β‘‘
ssm_ba[2048,64][2048, 64]Q4_KProduces Ξ²t\beta_t (write strength) and Ξ±t\alpha_t (decay rate)
ssm_dt[32][32]F32Time-step bias dt_bias\text{dt\_bias}, added to Ξ±\alpha in step β‘’
ssm_a[32][32]F32βˆ’exp⁑(Alog⁑)-\exp(A_{\log}), ensures exp⁑(gate)<1\exp(\text{gate}) < 1 for decay
ssm_norm[128][128]F32Output RMSNorm in step β‘€ (note: NOT zero-centered variant)
ssm_out[4096,2048][4096, 2048]Q4_KProjects gated output back to hidden dim in step β‘€

Input Projections (Step β‘ )

Each layer performs three parallel projections on input xx:

ProjectionDimensionsQuantOutput split
attn_qkv2048β†’81922048 \to 8192Q4_KQ [128Γ—16][128 \times 16] + K [128Γ—16][128 \times 16] + V [128Γ—32][128 \times 32]
attn_gate2048β†’40962048 \to 4096Q4_KGate vector zz [128Γ—32][128 \times 32]
ssm_ba2048β†’642048 \to 64Q4_KBeta + Alpha combined (grouped by K-head)

Note the asymmetric head structure: Q heads = K heads = 16, V heads = 32, head_dim = 128. Q and K share the same head count, but V has 2x more.

This creates a mismatch: the delta rule state update operates per-head, so Q/K/V head counts must align. The solution is to repeat Q and K from 16 heads to 32 heads β€” each Q/K-head is duplicated twice to match V’s 32 heads. This mirrors the GQA approach from full attention: fewer K-heads are shared across multiple β€œworking heads,” saving Q/K projection parameters.

After expansion, the model has 32 independent delta rule states, each H(i)∈R128Γ—128H^{(i)} \in \mathbb{R}^{128 \times 128}, for a total state size of 32Γ—128Γ—128=128Γ—409632 \times 128 \times 128 = 128 \times 4096. More V-heads mean larger state capacity, while Q/K serve only as β€œaddresses” and need fewer parameters.

Why Conv1d? (Step β‘‘)

The attn_qkv projection produces an 8192-dimensional QKV vector that only contains information from the current token xtx_t. But in the delta rule state update (step β‘£), the quality of ktk_t and vtv_t directly determines how good the key-value associations written to the state matrix are. If kt,vtk_t, v_t depend entirely on a single token, expressiveness is limited.

Conv1d addresses this: with a kernel=4 causal convolution, the current token’s QKV incorporates information from the previous 3 tokens before entering the delta rule:

qkvβ€²=SiLU ⁣(Conv1ddepthwise ⁣([conv_state⏟lastΒ 3Β tokens;β€…β€Šqkv⏟currentΒ token],β€…β€Škernel=4))\text{qkv}' = \text{SiLU}\!\left(\text{Conv1d}_{\text{depthwise}}\!\left([\underbrace{\text{conv\_state}}_{\text{last 3 tokens}};\; \underbrace{\text{qkv}}_{\text{current token}}],\; \text{kernel}=4\right)\right)

This provides two key benefits:

  1. Lossless short-range context: The last 3 tokens’ information is directly injected into Q/K/V via convolution, bypassing the lossy compression of state matrix HH. For dependencies within the 4-token window, the model has access to precise, original information
  2. Complementary to recurrence: Conv1d handles precise short-range dependencies (4-token window), delta rule handles long-range dependencies (compressed via state matrix HH) β€” clear division of labor

ssm_conv1d.weight [4, 8192] is a depthwise convolution kernel β€” each of the 8192 channels has its own independent 4-element kernel. It doesn’t mix information across Q/K/V channels; it only fuses along the time dimension. At runtime, a conv_state [3, 8192] cache (see section 2.7) stores the previous 3 tokens’ qkv values.

This project β†’ conv1d β†’ recurrence three-stage pipeline is inherited from Mamba β€” the same pattern introduced in the SSM article. After Conv1d + SiLU, the output is split into Q, K, V vectors for subsequent normalization and state update.

2.3 Q/K Normalization and Beta Activation

Formula mapping: Preprocessing qt,ktq_t, k_t (normalization) and Ξ²t∈[0,1]\beta_t \in [0,1] (activation) β€” preparing variables before they enter the delta rule.

After Conv1d splits out Q, K, V, several additional processing steps are needed before entering the delta rule:

  • L2 normalization: Q and K are each L2-normalized, projecting them onto the unit sphere to stabilize delta rule key-value associations
  • Q scaling: Q←Q/dvQ \leftarrow Q / \sqrt{d_v}, preventing dot products from growing with dimension
  • Beta sigmoid: Ξ²t←σ(Ξ²t)\beta_t \leftarrow \sigma(\beta_t), constraining write strength to [0,1][0, 1]

2.4 Decay Gating

Formula mapping: Step 1’s H~tβˆ’1=exp⁑(Ξ±t)βŠ™Htβˆ’1\tilde{H}_{t-1} = \exp(\alpha_t) \odot H_{t-1} β€” constructing the decay coefficient exp⁑(Ξ±t)\exp(\alpha_t) that controls per-element forgetting of the old state.

The decay gate Ξ±t\alpha_t and write strength Ξ²t\beta_t come from a single combined projection:

ParameterShapeMeaning
ssm_ba2048β†’642048 \to 64Combined beta+alpha projection (grouped by K-head)
ssm_dt[32][32]Time-step bias (learnable)
ssm_a[32][32]State decay coefficient, stored as βˆ’exp⁑(Alog⁑)-\exp(A_{\log})

The 64-dimensional output of ssm_ba is split by K-head groups: 16 K-heads Γ— 4 dims = 64, where each group of 4 contains 2 beta values (for the 2 V-heads mapped to that K-head) and 2 alpha values. After splitting, both beta and alpha have 32 values (one per V-head).

gatet=softplus(Ξ±t+dt_bias)Γ—(βˆ’A)\text{gate}_t = \text{softplus}(\alpha_t + \text{dt\_bias}) \times (-A)

This ensures exp⁑(gatet)<1\exp(\text{gate}_t) < 1 for exponential decay, similar to Mamba’s Ξ”\Delta parameter.

2.5 Chunked Parallel Computation

Formula mapping: Executing steps 1–5 β€” decay, state update Ht=H~+Ξ²tkt(vtβˆ’H~kt)⊀H_t = \tilde{H} + \beta_t k_t(v_t - \tilde{H}k_t)^\top, and output ot=Htqto_t = H_t q_t.

Autoregressive mode (decode, 1 token at a time) directly updates the state:

H~=exp⁑(gatet)βŠ™Htβˆ’1,Ht=H~+Ξ²tβ‹…kt(vtβˆ’H~kt)⊀,outt=Htβ‹…qt\tilde{H} = \exp(\text{gate}_t) \odot H_{t-1}, \quad H_t = \tilde{H} + \beta_t \cdot k_t (v_t - \tilde{H} k_t)^\top, \quad \text{out}_t = H_t \cdot q_t

Chunked parallel mode (prefill) splits the sequence into chunks of 64 tokens with decayed causal attention within each chunk:

Mij=exp⁑ ⁣(βˆ‘l=j+1igatel),intra_chunk=(MβŠ™QK⊀)β‹…VM_{ij} = \exp\!\left(\sum_{l=j+1}^{i} \text{gate}_l\right), \quad \text{intra\_chunk} = (M \odot Q K^\top) \cdot V

State propagates across chunks, similar to Mamba-2’s SSD algorithm.

2.6 Output Gating

Formula mapping: Post-processing after ot=Htqto_t = H_t q_t β€” the raw delta rule output needs gating and projection to become the layer output.

SSM output goes through RMSNorm then element-wise multiplication with the SiLU-activated gate:

output=RMSNorm(attn_out)βŠ™SiLU(z)\text{output} = \text{RMSNorm}(\text{attn\_out}) \odot \text{SiLU}(z)

Finally projected back to hidden dimension via ssm_out (4096β†’20484096 \to 2048).

2.7 Cache

Formula mapping: Persistent storage of state matrix HtH_t and Conv1d history across tokens β€” the foundation that enables the recurrence to carry information across tokens.

Each layer maintains two fixed-size states:

Cache typeShapeSize (f32)Purpose
delta_state[128,4096,batch][128, 4096, \text{batch}]~2 MB/seqRecurrent state matrix HtH_t
conv_state[3,8192,batch][3, 8192, \text{batch}]~96 KB/seqConv1d history (last 3 tokens)

Unlike full attention’s KV cache (grows linearly with sequence length), delta_state is fixed size β€” the core SSM advantage.


3. Gated Full Attention Layers

The full attention layers (every 4th layer) use sigmoid-gated GQA with several key differences from standard Transformer attention.

3.1 Gated Q Projection

The Q projection outputs 2Γ— the head_dim β€” half for query, half for gate:

attn_q(x)∈R8192β†’Q∈R256Γ—16⏟query+G∈R256Γ—16⏟gate\text{attn\_q}(x) \in \mathbb{R}^{8192} \to \underbrace{Q \in \mathbb{R}^{256 \times 16}}_{\text{query}} + \underbrace{G \in \mathbb{R}^{256 \times 16}}_{\text{gate}}

After attention computation, the output is multiplied by sigmoid of the gate:

attn_out=Attention(Q,K,V)βŠ™Οƒ(G)\text{attn\_out} = \text{Attention}(Q, K, V) \odot \sigma(G)

3.2 GQA: 16 Q Heads, 2 KV Heads

ProjectionDimensionsQuantizationHead config
attn_k2048β†’5122048 \to 512Q4_K2 heads Γ— 256 head_dim
attn_v2048β†’5122048 \to 512Q6_K2 heads Γ— 256 head_dim

This is 8:1 GQA (Grouped-Query Attention) β€” 16 Q heads share 2 KV groups.

3.3 Partial RoPE

Positional encoding applies to only the first 64 of 256 head dimensions:

partial_rotary_factor=64256=0.25\text{partial\_rotary\_factor} = \frac{64}{256} = 0.25

75% of head dimensions remain position-agnostic, focusing on semantic content. RoPE uses NeoX layout with ΞΈ=5Γ—106\theta = 5 \times 10^6.

3.4 QK Norm

Both Q and K pass through RMSNorm before RoPE:

Qβ€²=RoPE(RMSNorm(Q)),Kβ€²=RoPE(RMSNorm(K))Q' = \text{RoPE}(\text{RMSNorm}(Q)), \quad K' = \text{RoPE}(\text{RMSNorm}(K))

Essential for stabilizing attention scores at head_dim = 256.


4. MoE: 512 Expert Top-10 Routing

All 48 layers use the same MoE FFN structure.

4.1 Router

A full-precision (F32) linear projection maps hidden states to 512-dimensional expert space:

logits=Wrouterβ‹…x,weights=softmax(logits),selected=TopK(weights,k=10)\text{logits} = W_{\text{router}} \cdot x, \quad \text{weights} = \text{softmax}(\text{logits}), \quad \text{selected} = \text{TopK}(\text{weights}, k=10)

Selected routing weights are renormalized (norm_top_k_prob=true):

wiβ€²=wiβˆ‘j∈TopKwjw_i' = \frac{w_i}{\sum_{j \in \text{TopK}} w_j}

4.2 Expert FFN

Each expert is a small SiLU-gated FFN:

experti(x)=Wdown(i)[SiLU ⁣(Wgate(i)x)βŠ™Wup(i)x]\text{expert}_i(x) = W_{\text{down}}^{(i)} \left[ \text{SiLU}\!\left(W_{\text{gate}}^{(i)} x\right) \odot W_{\text{up}}^{(i)} x \right]
ProjectionDimensionsQuantPer-layer size (Γ—512)
ffn_gate_exps2048β†’5122048 \to 512Q4_K288 MB
ffn_up_exps2048β†’5122048 \to 512Q4_K288 MB
ffn_down_exps512β†’2048512 \to 2048Q6_K420 MB
Total996 MB

4.3 Shared Expert

A DeepSeek-style shared expert that all tokens pass through, with sigmoid gating:

shared_out=Οƒ(Wgateβ‹…x)β‹…FFNshared(x)\text{shared\_out} = \sigma(W_{\text{gate}} \cdot x) \cdot \text{FFN}_{\text{shared}}(x)

Final MoE output = weighted sum of routed experts + shared expert output.

4.4 Sparsity Analysis

MetricQwen3-CNDeepSeek-V3Mixtral
Expert count5122568
Top-K1082
Activation ratio1.95%3.1%25%
Total params79.7B671B47B
Active params~3B~37B~13B

5. Layer Structure and Data Flow

5.1 Full Forward Pass

Each layer follows a Pre-Norm dual-residual structure:

h1=x+Attention(RMSNormattn(x))h2=h1+MoE(RMSNormpost(h1))\begin{aligned} h_1 &= x + \text{Attention}(\text{RMSNorm}_{\text{attn}}(x)) \\ h_2 &= h_1 + \text{MoE}(\text{RMSNorm}_{\text{post}}(h_1)) \end{aligned}
Input x [2048]
  β”‚
  β”œβ”€ RMSNorm (attn_norm)
  β”‚    β”œβ”€ QKV Projection β†’ Conv1d(k=4) + SiLU β†’ Q, K, V
  β”‚    β”œβ”€ Gate Projection β†’ z
  β”‚    β”œβ”€ Beta/Alpha β†’ Ξ², gate
  β”‚    β”œβ”€ State Update: H = exp(gate) βŠ™ H_prev + Ξ²Β·kΒ·vα΅€
  β”‚    β”œβ”€ Output: out = H Β· q
  β”‚    β”œβ”€ Gated: RMSNorm(out) βŠ™ SiLU(z)
  β”‚    └─ SSM Out Projection [4096 β†’ 2048]
  β”‚
  β”œβ”€ Residual: h₁ = x + attn_out
  β”‚
  β”œβ”€ RMSNorm (post_attention_norm)
  β”‚    β”œβ”€ Router [2048 β†’ 512] β†’ Softmax β†’ TopK(10)
  β”‚    β”œβ”€ 10Γ— Expert FFN: SiLU(gateΒ·x) βŠ™ upΒ·x β†’ down
  β”‚    β”œβ”€ Weighted Sum (normalized weights)
  β”‚    β”œβ”€ Shared Expert FFN + Sigmoid Gate
  β”‚    └─ moe_out = routed + shared
  β”‚
  └─ Residual: hβ‚‚ = h₁ + moe_out β†’ Output [2048]

5.2 Zero-Centered RMSNorm

All RMSNorm layers (except internal ssm_norm) use the zero-centered variant:

RMSNorm(x)=xRMS(x)β‹…(1+w)\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot (1 + w)

Weights are stored with +1 offset. When woriginal=0w_{\text{original}} = 0, normalization becomes the identity β€” aiding training stability.

5.3 Embedding and Output

ComponentDimensionsQuantization
token_embd151936β†’2048151936 \to 2048Q4_K
output_norm[2048][2048]F32
output2048β†’1519362048 \to 151936Q6_K

6. Quantization and Weight Distribution

6.1 Q4_K_M Mixed Quantization

ComponentQuantizationReason
Expert gate/upQ4_KMost numerous, aggressive compression
Expert down, attn_vQ6_KDirectly affects residual precision
Router, NormsF32Precision-critical
SSM params (A, dt, conv1d)F32Decay rates need high precision

6.2 Weight Distribution: 97% is MoE Expert

Per-layer breakdown:

ComponentPer-layer sizeShare
MoE Experts (512)~996 MB97%
Dense (Attention/SSM + Shared Expert + Norms)~25 MB3%

Full model:

ComponentSizeShare
48 layers MoE experts~47.8 GB96%
48 layers dense~1.2 GB2.4%
Embedding + Output~0.4 GB0.8%
Total~49.4 GB

Dense weights are only 1.2 GB β€” easily fits in any GPU. MoE experts dominate and are the primary target for GPU/CPU offloading strategies.


7. Comparison with Other Hybrid Models

FeatureJamba 52BZamba2 2.7BHymba 1.5BQwen3-CN 80B
SSM typeMambaMamba2Mamba2GatedDeltaNet
FusionInterleaved 7:1Shared AttnParallel headsInterleaved 3:1
MoE16E, top-2NoneNone512E, top-10
Total params52B2.7B1.5B79.7B
Active params12B2.7B1.5B~3B
Attn layer share12.5%~15%50%25%
State typeVectorVectorVector+KVMatrix

Summary

Qwen3-Coder-Next 80B deeply fuses SSM, Attention, and MoE:

ComponentTechnical choiceRole
36 GatedDeltaNet layersMatrix state + delta rule + exp decay gatingLinear-complexity long-sequence modeling
12 Full Attention layersSigmoid gating + 8:1 GQA + Partial RoPEPrecise token retrieval and ICL
48 MoE layers512 experts top-10 + shared expertParameter efficiency: 80B params, 3B compute

This design reflects the current trend: combining complementary mechanisms rather than pushing a single one to its extreme. Linear attention provides efficiency, full attention provides precision, MoE provides parameter-compute decoupling β€” each compensating for the others’ weaknesses.

Further Reading