Qwen3-Coder-Next Architecture: When SSM, Attention, and MoE Converge

The previous three articles introduced MoE, SSM, and Hybrid architectures as separate concepts. This article uses a real production model — Qwen3-Coder-Next 80B — to show how all three work together in a single architecture.

Qwen3-Coder-Next is a 79.7B parameter Hybrid MoE model that activates only ~3B parameters per token (3.8% utilization). It fuses GatedDeltaNet (a delta-rule-based linear attention), standard full attention, and 512-expert MoE across 48 layers — one of the most complex hybrid architectures to date.

1. Architecture Overview

1.1 Key Parameters

Parameter	Value	Notes
Total layers	48	`block_count=48`
Layer types	36 recurrent + 12 full attention	3:1 alternating
Hidden dim	2048	Much smaller than typical LLMs (LLaMA-70B uses 8192)
Total params	79.7B
Active params/token	~3B	Only 3.8% activated
MoE config	512 experts, top-10	+ 1 shared expert
Expert FFN dim	512	Small but numerous
Context length	262,144	256K tokens

1.2 The 3:1 Hybrid Layer Pattern

The 48 layers follow a fixed alternating pattern — every 4 layers form a cycle with 3 GatedDeltaNet (linear attention) layers followed by 1 full attention layer:

Layer  0: GatedDeltaNet    ← recurrent
Layer  1: GatedDeltaNet
Layer  2: GatedDeltaNet
Layer  3: Full Attention    ← every 4th layer
Layer  4: GatedDeltaNet
...
Layer 47: Full Attention    ← last layer is also full attention

Compared to Jamba’s 7:1 (Mamba:Attention) ratio, Qwen3-Coder-Next gives full attention a higher share (25% vs 12.5%), likely optimized for coding tasks that require precise token retrieval.

Key design: All 48 layers share the same MoE FFN structure regardless of attention type. The layer type difference is only in the attention component.

1.3 GGUF Tensor View: What Does a Layer Look Like?

In llama.cpp’s GGUF format, each layer consists of a set of named tensors. Below are two representative blocks from the actual GGUF file — one GatedDeltaNet layer and one Full Attention layer — giving you a concrete view of each layer’s “physical composition.”

Block 0 (GatedDeltaNet layer):

Tensor                            Shape             Quant   Role
───────────────────────────────────────────────────────────────────
  # ── Attention/SSM ──
blk.0.attn_norm.weight           [2048]            F32     Pre-attention RMSNorm
blk.0.attn_qkv.weight           [2048, 8192]      Q4_K    Combined Q+K+V projection
blk.0.attn_gate.weight           [2048, 4096]      Q4_K    Output gate vector z
blk.0.ssm_ba.weight              [2048, 64]        Q4_K    Combined β+α projection
blk.0.ssm_conv1d.weight          [4, 8192]         F32     1D causal conv kernel
blk.0.ssm_dt                     [32]              F32     Time-step bias (dt_bias)
blk.0.ssm_a                      [32]              F32     State decay coeff (-exp(A))
blk.0.ssm_norm.weight            [128]             F32     SSM output RMSNorm
blk.0.ssm_out.weight             [4096, 2048]      Q4_K    Output projection

  # ── MoE (same structure in all layers) ──
blk.0.post_attention_norm.weight [2048]            F32     Pre-MoE RMSNorm
blk.0.ffn_gate_inp.weight        [2048, 512]       F32     Router (scores 512 experts)
blk.0.ffn_gate_exps.weight       [2048, 512, 512]  Q4_K    512 Expert gate projections
blk.0.ffn_up_exps.weight         [2048, 512, 512]  Q4_K    512 Expert up projections
blk.0.ffn_down_exps.weight       [512, 2048, 512]  Q6_K    512 Expert down projections
blk.0.ffn_gate_shexp.weight      [2048, 512]       Q4_K    Shared Expert gate
blk.0.ffn_up_shexp.weight        [2048, 512]       Q4_K    Shared Expert up
blk.0.ffn_down_shexp.weight      [512, 2048]       Q6_K    Shared Expert down
blk.0.ffn_gate_inp_shexp.weight  [2048, 1]         F16     Shared Expert sigmoid gate

Block 3 (Full Attention layer):

Tensor                            Shape             Quant   Role
───────────────────────────────────────────────────────────────────
  # ── Attention (entirely different from Block 0) ──
blk.3.attn_norm.weight           [2048]            F32     Pre-attention RMSNorm
blk.3.attn_q.weight              [2048, 8192]      Q4_K    Q projection (includes gate)
blk.3.attn_k.weight              [2048, 512]       Q4_K    K projection (2 heads × 256)
blk.3.attn_v.weight              [2048, 512]       Q6_K    V projection (2 heads × 256)
blk.3.attn_q_norm.weight         [256]             F32     Q RMSNorm
blk.3.attn_k_norm.weight         [256]             F32     K RMSNorm
blk.3.attn_output.weight         [4096, 2048]      Q4_K    Attention output projection

  # ── MoE (identical to Block 0, omitted) ──
blk.3.ffn_*                      ...                       (same as above)

Key differences between the two layer types:

Difference	GatedDeltaNet (Block 0)	Full Attention (Block 3)
QKV projection	Combined `attn_qkv` [2048→8192]	Separate `attn_q/k/v`
Gating	`attn_gate` (output gate z)	Sigmoid gate built into Q
Unique tensors	`ssm_*` (6 SSM-specific tensors)	`attn_q/k_norm` (QK normalization)
Output proj	`ssm_out` [4096→2048]	`attn_output` [4096→2048]
F32 precision	`ssm_a`, `ssm_dt`, `ssm_conv1d` all F32	`attn_v` uses Q6_K

Note that MoE tensors (ffn_* prefix) are identical across both layer types — this is the concrete manifestation of the design principle from section 1.2: “layer type difference is only in the attention component.”

2. GatedDeltaNet: Linear Attention Layers

The 36 recurrent layers use GatedDeltaNet rather than standard Mamba — a linear attention mechanism based on the delta rule, with a matrix state instead of a vector.

2.1 From Mamba to Delta Rule

Recall Mamba’s state update from the SSM article:

h_t = \bar{A}_t \cdot h_{t-1} + \bar{B}_t \cdot x_t

The state $h_t$ is a vector ( $\mathbb{R}^N$ ) with additive updates.

DeltaNet (Yang et al., 2024) promotes the state to a matrix $H_t \in \mathbb{R}^{d_k \times d_v}$ with delta rule updates:

H_t = (I - \beta_t k_t k_t^\top) H_{t-1} + \beta_t v_t k_t^\top

The state matrix $H$ can be viewed as an associative memory — a “fuzzy key-value store.” Expanding the formula into its equivalent delta form makes the update semantics clearer:

H_t = H_{t-1} + \beta_t \cdot k_t \left(v_t - H_{t-1} k_t\right)^\top

Each term has a clear meaning:

$H_{t-1} k_t \in \mathbb{R}^{d_v}$ : retrieve from the state matrix using current key $k_t$ — the “old value currently associated with $k_t$ in memory”
$v_t \in \mathbb{R}^{d_v}$ : the new value from the current token’s input projection — the target we want to associate with $k_t$
$v_t - H_{t-1} k_t$ : the delta (correction) — the error between the new target and old memory
$\beta_t \cdot k_t (\cdots)^\top$ : write this correction back into the state matrix as an outer product

This is where the “delta rule” name comes from — instead of writing the new value $v_t$ directly, it writes only the correction $v_t - H_{t-1}k_t$ . Note that $v_t$ and $H_{t-1}k_t$ are both $\mathbb{R}^{d_v}$ vectors but with entirely different meanings: the former comes from the current input projection, the latter is an old value retrieved from the compressed historical state.

Why does $H_{t-1}k_t$ retrieve the “old value”?

If you’re familiar with standard Attention ( $\text{softmax}(QK^\top) \cdot V$ ), you might wonder: don’t you first compute QK similarity, then multiply by V to get the value? Why does multiplying $H$ by $k$ directly produce a value?

The key is how $H$ is constructed. Taking the simplest additive update: $H_t = H_{t-1} + k_t v_t^\top$ . Unrolled, $H$ is the sum of outer products of all historical key-value pairs:

$H_{t-1} = \sum_{i=1}^{t-1} k_i \, v_i^\top$

When you query with the current $k_t$ :

$H_{t-1} k_t = \left(\sum_{i=1}^{t-1} k_i v_i^\top\right) k_t = \sum_{i=1}^{t-1} \underbrace{(k_i^\top k_t)}_{\text{key similarity}} \, v_i$

The right-hand side is a weighted sum of all historical values, weighted by key-key dot product similarity — essentially the same operation as standard Attention’s $\sum \text{softmax}(q^\top k_i) \cdot v_i$ , except: (1) no softmax normalization; (2) all historical K/V are compressed into a fixed-size matrix $H$ , so retrieval is a single matrix-vector multiply at $O(1)$ cost instead of $O(n)$ .

Comparing the two update strategies:

Strategy	Update rule	Effect
Additive (Mamba)	$H \mathrel{+}= v \cdot k^\top$	Old and new values accumulate, information blurs
Delta rule	$H \mathrel{+}= \beta \cdot k(v_{new} - Hk)^\top$	Old value corrected toward new value, precise overwrite

When $\beta_t = 1$ , the update yields $H_t k_t = v_t$ — perfect overwrite. This gives DeltaNet an advantage over Mamba in tasks requiring precise recall of recent key-value pairs (e.g., in-context recall).

Gated DeltaNet (Yang, Kautz & Hatamizadeh, 2024) adds element-wise decay gating on top of the delta rule:

\tilde{H}_{t-1} = \exp(\alpha_t) \odot H_{t-1}, \quad H_t = \tilde{H}_{t-1} + \beta_t \cdot k_t \left(v_t - \tilde{H}_{t-1} k_t\right)^\top, \quad o_t = H_t \, q_t

The difference from DeltaNet: before updating, the old state undergoes per-element exponential decay ( $\alpha_t$ is input-dependent, giving each state element its own decay rate), then the delta rule update is applied, and finally a query vector $q_t$ reads the output. The complete five-step process:

Decay: $\tilde{H}_{t-1} = \exp(\alpha_t) \odot H_{t-1}$ — apply per-element exponential decay to old state
Retrieve old value: $\tilde{H}_{t-1} k_t$ — use $k_t$ to query the old value from the decayed state
Compute delta: $v_t - \tilde{H}_{t-1} k_t$ — error between new target and old memory
Write correction: $H_t = \tilde{H}_{t-1} + \beta_t \cdot k_t (\cdots)^\top$ — write the correction back as an outer product
Read output: $o_t = H_t \, q_t$ — use $q_t$ to retrieve this layer’s output from the updated state

Note that $q_t$ and $k_t$ both come from the same projection (attn_qkv) but serve different roles: $k_t$ is the write address (determines where in the state matrix a value is stored), $q_t$ is the read address (determines what information is retrieved as output). Separating the two allows writing and reading to focus on different features — the same Q/K division of labor as in standard Attention.

Decay gating lets the model actively forget information no longer needed ( $\exp(\alpha_t) < 1$ causes the old state to gradually decay), while the delta rule ensures precise overwrite for information that should be remembered — flexible forgetting + precise overwriting, the core advantage of GatedDeltaNet over pure Mamba.

2.2 Complete Data Flow and Input Processing

Section 2.1 presented the complete Gated DeltaNet formula (five steps: decay → retrieve → delta → write → read), but there’s a gap between formulas and the actual model: which GGUF tensor corresponds to which step? What transformations does data undergo from input to output? This section provides the overall flow diagram; subsequent sections (2.3–2.7) expand on each step in detail.

Forward Pass Flow

┌───────────────── GatedDeltaNet Layer Forward Pass ──────────────────┐
│                                                                      │
│  ① Three parallel projections                                        │
│     x ─→ attn_qkv  [2048→8192] ─→ qkv [8192]                      │
│     x ─→ attn_gate [2048→4096] ─→ z [4096]    (output gate)        │
│     x ─→ ssm_ba    [2048→64]  ─→ β+α [64]    (decay + write)      │
│                                                                      │
│  ② Conv1d temporal fusion                                            │
│     [conv_state; qkv] ─→ Conv1d(k=4) ─→ SiLU ─→ qkv'              │
│      ↑ cached last 3 tokens                                         │
│                                                                      │
│  ③ Split + preprocessing                                             │
│     qkv' → Q [128×16] + K [128×16] + V [128×32]                    │
│     Q ← L2Norm(Q) / √d_v     → q_t in formula                      │
│     K ← L2Norm(K)            → k_t in formula                      │
│     Q, K repeat 16→32 heads  → align with V's 32 heads              │
│     β ← sigmoid(β_raw)       → β_t ∈ [0,1] in formula              │
│     gate ← softplus(α + dt_bias) × (-A)                             │
│                               → exp(α_t) in formula                 │
│                                                                      │
│  ④ Delta Rule state update                                           │
│     H̃ = exp(gate) ⊙ H_{t-1}       ← delta_state [128,4096]        │
│     H_t = H̃ + β · k_t(v_t − H̃k_t)ᵀ                               │
│     out_t = H_t · q_t                                                │
│                                                                      │
│  ⑤ Output gating                                                     │
│     output = ssm_out( ssm_norm(out) ⊙ SiLU(z) )                     │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Tensor-to-Formula Mapping

GGUF Tensor	Shape	Quant	Formula / Role
`attn_qkv`	$[2048, 8192]$	Q4_K	Produces raw $q, k, v$ (split after Conv1d)
`attn_gate`	$[2048, 4096]$	Q4_K	Output gate $z$ , controls SSM output flow in step ⑤
`ssm_conv1d`	$[4, 8192]$	F32	Causal conv kernel, fuses last 3 tokens’ context in step ②
`ssm_ba`	$[2048, 64]$	Q4_K	Produces $\beta_t$ (write strength) and $\alpha_t$ (decay rate)
`ssm_dt`	$[32]$	F32	Time-step bias $\text{dt\_bias}$ , added to $\alpha$ in step ③
`ssm_a`	$[32]$	F32	$-\exp(A_{\log})$ , ensures $\exp(\text{gate}) < 1$ for decay
`ssm_norm`	$[128]$	F32	Output RMSNorm in step ⑤ (note: NOT zero-centered variant)
`ssm_out`	$[4096, 2048]$	Q4_K	Projects gated output back to hidden dim in step ⑤

Input Projections (Step ①)

Each layer performs three parallel projections on input $x$ :

Projection	Dimensions	Quant	Output split
`attn_qkv`	$2048 \to 8192$	Q4_K	Q $[128 \times 16]$ + K $[128 \times 16]$ + V $[128 \times 32]$
`attn_gate`	$2048 \to 4096$	Q4_K	Gate vector $z$ $[128 \times 32]$
`ssm_ba`	$2048 \to 64$	Q4_K	Beta + Alpha combined (grouped by K-head)

Note the asymmetric head structure: Q heads = K heads = 16, V heads = 32, head_dim = 128. Q and K share the same head count, but V has 2x more.

This creates a mismatch: the delta rule state update operates per-head, so Q/K/V head counts must align. The solution is to repeat Q and K from 16 heads to 32 heads — each Q/K-head is duplicated twice to match V’s 32 heads. This mirrors the GQA approach from full attention: fewer K-heads are shared across multiple “working heads,” saving Q/K projection parameters.

After expansion, the model has 32 independent delta rule states, each $H^{(i)} \in \mathbb{R}^{128 \times 128}$ , for a total state size of $32 \times 128 \times 128 = 128 \times 4096$ . More V-heads mean larger state capacity, while Q/K serve only as “addresses” and need fewer parameters.

Why Conv1d? (Step ②)

The attn_qkv projection produces an 8192-dimensional QKV vector that only contains information from the current token $x_t$ . But in the delta rule state update (step ④), the quality of $k_t$ and $v_t$ directly determines how good the key-value associations written to the state matrix are. If $k_t, v_t$ depend entirely on a single token, expressiveness is limited.

Conv1d addresses this: with a kernel=4 causal convolution, the current token’s QKV incorporates information from the previous 3 tokens before entering the delta rule:

\text{qkv}' = \text{SiLU}\!\left(\text{Conv1d}_{\text{depthwise}}\!\left([\underbrace{\text{conv\_state}}_{\text{last 3 tokens}};\; \underbrace{\text{qkv}}_{\text{current token}}],\; \text{kernel}=4\right)\right)

This provides two key benefits:

Lossless short-range context: The last 3 tokens’ information is directly injected into Q/K/V via convolution, bypassing the lossy compression of state matrix $H$ . For dependencies within the 4-token window, the model has access to precise, original information
Complementary to recurrence: Conv1d handles precise short-range dependencies (4-token window), delta rule handles long-range dependencies (compressed via state matrix $H$ ) — clear division of labor

ssm_conv1d.weight [4, 8192] is a depthwise convolution kernel — each of the 8192 channels has its own independent 4-element kernel. It doesn’t mix information across Q/K/V channels; it only fuses along the time dimension. At runtime, a conv_state [3, 8192] cache (see section 2.7) stores the previous 3 tokens’ qkv values.

This project → conv1d → recurrence three-stage pipeline is inherited from Mamba — the same pattern introduced in the SSM article. After Conv1d + SiLU, the output is split into Q, K, V vectors for subsequent normalization and state update.

2.3 Q/K Normalization and Beta Activation

Formula mapping: Preprocessing $q_t, k_t$ (normalization) and $\beta_t \in [0,1]$ (activation) — preparing variables before they enter the delta rule.

After Conv1d splits out Q, K, V, several additional processing steps are needed before entering the delta rule:

L2 normalization: Q and K are each L2-normalized, projecting them onto the unit sphere to stabilize delta rule key-value associations
Q scaling: $Q \leftarrow Q / \sqrt{d_v}$ , preventing dot products from growing with dimension
Beta sigmoid: $\beta_t \leftarrow \sigma(\beta_t)$ , constraining write strength to $[0, 1]$

2.4 Decay Gating

Formula mapping: Step 1’s $\tilde{H}_{t-1} = \exp(\alpha_t) \odot H_{t-1}$ — constructing the decay coefficient $\exp(\alpha_t)$ that controls per-element forgetting of the old state.

The decay gate $\alpha_t$ and write strength $\beta_t$ come from a single combined projection:

Parameter	Shape	Meaning
ssm_ba	$2048 \to 64$	Combined beta+alpha projection (grouped by K-head)
ssm_dt	$[32]$	Time-step bias (learnable)
ssm_a	$[32]$	State decay coefficient, stored as $-\exp(A_{\log})$

The 64-dimensional output of ssm_ba is split by K-head groups: 16 K-heads × 4 dims = 64, where each group of 4 contains 2 beta values (for the 2 V-heads mapped to that K-head) and 2 alpha values. After splitting, both beta and alpha have 32 values (one per V-head).

\text{gate}_t = \text{softplus}(\alpha_t + \text{dt\_bias}) \times (-A)

This ensures $\exp(\text{gate}_t) < 1$ for exponential decay, similar to Mamba’s $\Delta$ parameter.

2.5 Chunked Parallel Computation

Formula mapping: Executing steps 1–5 — decay, state update $H_t = \tilde{H} + \beta_t k_t(v_t - \tilde{H}k_t)^\top$ , and output $o_t = H_t q_t$ .

Autoregressive mode (decode, 1 token at a time) directly updates the state:

\tilde{H} = \exp(\text{gate}_t) \odot H_{t-1}, \quad H_t = \tilde{H} + \beta_t \cdot k_t (v_t - \tilde{H} k_t)^\top, \quad \text{out}_t = H_t \cdot q_t

Chunked parallel mode (prefill) splits the sequence into chunks of 64 tokens with decayed causal attention within each chunk:

M_{ij} = \exp\!\left(\sum_{l=j+1}^{i} \text{gate}_l\right), \quad \text{intra\_chunk} = (M \odot Q K^\top) \cdot V

State propagates across chunks, similar to Mamba-2’s SSD algorithm.

2.6 Output Gating

Formula mapping: Post-processing after $o_t = H_t q_t$ — the raw delta rule output needs gating and projection to become the layer output.

SSM output goes through RMSNorm then element-wise multiplication with the SiLU-activated gate:

\text{output} = \text{RMSNorm}(\text{attn\_out}) \odot \text{SiLU}(z)

Finally projected back to hidden dimension via ssm_out ( $4096 \to 2048$ ).

2.7 Cache

Formula mapping: Persistent storage of state matrix $H_t$ and Conv1d history across tokens — the foundation that enables the recurrence to carry information across tokens.

Each layer maintains two fixed-size states:

Cache type	Shape	Size (f32)	Purpose
delta_state	$[128, 4096, \text{batch}]$	~2 MB/seq	Recurrent state matrix $H_t$
conv_state	$[3, 8192, \text{batch}]$	~96 KB/seq	Conv1d history (last 3 tokens)

Unlike full attention’s KV cache (grows linearly with sequence length), delta_state is fixed size — the core SSM advantage.

3. Gated Full Attention Layers

The full attention layers (every 4th layer) use sigmoid-gated GQA with several key differences from standard Transformer attention.

3.1 Gated Q Projection

The Q projection outputs 2× the head_dim — half for query, half for gate:

\text{attn\_q}(x) \in \mathbb{R}^{8192} \to \underbrace{Q \in \mathbb{R}^{256 \times 16}}_{\text{query}} + \underbrace{G \in \mathbb{R}^{256 \times 16}}_{\text{gate}}

After attention computation, the output is multiplied by sigmoid of the gate:

\text{attn\_out} = \text{Attention}(Q, K, V) \odot \sigma(G)

3.2 GQA: 16 Q Heads, 2 KV Heads

Projection	Dimensions	Quantization	Head config
attn_k	$2048 \to 512$	Q4_K	2 heads × 256 head_dim
attn_v	$2048 \to 512$	Q6_K	2 heads × 256 head_dim

This is 8:1 GQA (Grouped-Query Attention) — 16 Q heads share 2 KV groups.

3.3 Partial RoPE

Positional encoding applies to only the first 64 of 256 head dimensions:

\text{partial\_rotary\_factor} = \frac{64}{256} = 0.25

75% of head dimensions remain position-agnostic, focusing on semantic content. RoPE uses NeoX layout with $\theta = 5 \times 10^6$ .

3.4 QK Norm

Both Q and K pass through RMSNorm before RoPE:

Q' = \text{RoPE}(\text{RMSNorm}(Q)), \quad K' = \text{RoPE}(\text{RMSNorm}(K))

Essential for stabilizing attention scores at head_dim = 256.

4. MoE: 512 Expert Top-10 Routing

All 48 layers use the same MoE FFN structure.

4.1 Router

A full-precision (F32) linear projection maps hidden states to 512-dimensional expert space:

\text{logits} = W_{\text{router}} \cdot x, \quad \text{weights} = \text{softmax}(\text{logits}), \quad \text{selected} = \text{TopK}(\text{weights}, k=10)

Selected routing weights are renormalized (norm_top_k_prob=true):

w_i' = \frac{w_i}{\sum_{j \in \text{TopK}} w_j}

4.2 Expert FFN

Each expert is a small SiLU-gated FFN:

\text{expert}_i(x) = W_{\text{down}}^{(i)} \left[ \text{SiLU}\!\left(W_{\text{gate}}^{(i)} x\right) \odot W_{\text{up}}^{(i)} x \right]

Projection	Dimensions	Quant	Per-layer size (×512)
ffn_gate_exps	$2048 \to 512$	Q4_K	288 MB
ffn_up_exps	$2048 \to 512$	Q4_K	288 MB
ffn_down_exps	$512 \to 2048$	Q6_K	420 MB
Total			996 MB

4.3 Shared Expert

A DeepSeek-style shared expert that all tokens pass through, with sigmoid gating:

\text{shared\_out} = \sigma(W_{\text{gate}} \cdot x) \cdot \text{FFN}_{\text{shared}}(x)

Final MoE output = weighted sum of routed experts + shared expert output.

4.4 Sparsity Analysis

Metric	Qwen3-CN	DeepSeek-V3	Mixtral
Expert count	512	256	8
Top-K	10	8	2
Activation ratio	1.95%	3.1%	25%
Total params	79.7B	671B	47B
Active params	~3B	~37B	~13B

5. Layer Structure and Data Flow

5.1 Full Forward Pass

Each layer follows a Pre-Norm dual-residual structure:

\begin{aligned} h_1 &= x + \text{Attention}(\text{RMSNorm}_{\text{attn}}(x)) \\ h_2 &= h_1 + \text{MoE}(\text{RMSNorm}_{\text{post}}(h_1)) \end{aligned}

Input x [2048]
  │
  ├─ RMSNorm (attn_norm)
  │    ├─ QKV Projection → Conv1d(k=4) + SiLU → Q, K, V
  │    ├─ Gate Projection → z
  │    ├─ Beta/Alpha → β, gate
  │    ├─ State Update: H = exp(gate) ⊙ H_prev + β·k·vᵀ
  │    ├─ Output: out = H · q
  │    ├─ Gated: RMSNorm(out) ⊙ SiLU(z)
  │    └─ SSM Out Projection [4096 → 2048]
  │
  ├─ Residual: h₁ = x + attn_out
  │
  ├─ RMSNorm (post_attention_norm)
  │    ├─ Router [2048 → 512] → Softmax → TopK(10)
  │    ├─ 10× Expert FFN: SiLU(gate·x) ⊙ up·x → down
  │    ├─ Weighted Sum (normalized weights)
  │    ├─ Shared Expert FFN + Sigmoid Gate
  │    └─ moe_out = routed + shared
  │
  └─ Residual: h₂ = h₁ + moe_out → Output [2048]

5.2 Zero-Centered RMSNorm

All RMSNorm layers (except internal ssm_norm) use the zero-centered variant:

\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot (1 + w)

Weights are stored with +1 offset. When $w_{\text{original}} = 0$ , normalization becomes the identity — aiding training stability.

5.3 Embedding and Output

Component	Dimensions	Quantization
token_embd	$151936 \to 2048$	Q4_K
output_norm	$[2048]$	F32
output	$2048 \to 151936$	Q6_K

6. Quantization and Weight Distribution

6.1 Q4_K_M Mixed Quantization

Component	Quantization	Reason
Expert gate/up	Q4_K	Most numerous, aggressive compression
Expert down, attn_v	Q6_K	Directly affects residual precision
Router, Norms	F32	Precision-critical
SSM params (A, dt, conv1d)	F32	Decay rates need high precision

6.2 Weight Distribution: 97% is MoE Expert

Per-layer breakdown:

Component	Per-layer size	Share
MoE Experts (512)	~996 MB	97%
Dense (Attention/SSM + Shared Expert + Norms)	~25 MB	3%

Full model:

Component	Size	Share
48 layers MoE experts	~47.8 GB	96%
48 layers dense	~1.2 GB	2.4%
Embedding + Output	~0.4 GB	0.8%
Total	~49.4 GB

Dense weights are only 1.2 GB — easily fits in any GPU. MoE experts dominate and are the primary target for GPU/CPU offloading strategies.

7. Comparison with Other Hybrid Models

Feature	Jamba 52B	Zamba2 2.7B	Hymba 1.5B	Qwen3-CN 80B
SSM type	Mamba	Mamba2	Mamba2	GatedDeltaNet
Fusion	Interleaved 7:1	Shared Attn	Parallel heads	Interleaved 3:1
MoE	16E, top-2	None	None	512E, top-10
Total params	52B	2.7B	1.5B	79.7B
Active params	12B	2.7B	1.5B	~3B
Attn layer share	12.5%	~15%	50%	25%
State type	Vector	Vector	Vector+KV	Matrix

Summary

Qwen3-Coder-Next 80B deeply fuses SSM, Attention, and MoE:

Component	Technical choice	Role
36 GatedDeltaNet layers	Matrix state + delta rule + exp decay gating	Linear-complexity long-sequence modeling
12 Full Attention layers	Sigmoid gating + 8:1 GQA + Partial RoPE	Precise token retrieval and ICL
48 MoE layers	512 experts top-10 + shared expert	Parameter efficiency: 80B params, 3B compute

This design reflects the current trend: combining complementary mechanisms rather than pushing a single one to its extreme. Linear attention provides efficiency, full attention provides precision, MoE provides parameter-compute decoupling — each compensating for the others’ weaknesses.