Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Mixture of Experts: Sparsely Activated Large Model Architecture

Mixture of Experts: Sparsely Activated Large Model Architecture

Updated 2026-04-06

The core contradiction of dense models: parameter count and computation are linearly coupled. A stronger model requires more parameters, but more parameters means the computation per token also grows linearly.

Mixture of Experts (MoE) breaks this coupling: large total parameters, but only a small fraction is activated per token — decoupling parameter count from computation.

Key numbers: Mixtral 8x7B has 47B total parameters, but each token only uses the compute equivalent of ~13B active parameters — achieving near-47B model capability at the inference cost of a 7B model.

Sparse MoE Fundamentals

In a standard Transformer, every token passes through the same FFN layer. MoE replaces this FFN with N parallel experts (each expert is a small FFN), plus a router that decides which experts each token is sent to.

Dense FFN vs MoE FFNDense TransformerAttentionFFNd_model → 4·d_model → d_modelOutputTotal params ≈ Active params例: 7B params → 7B activeMoE TransformerAttentionRouterE0E1E2E3E4E5E6E7OutputTotal params >> Active params例: Mixtral 47B total → ~13B active (top-2 of 8)

Each token is sent to only the Top-K experts (typically K=1K=1 or K=2K=2), with outputs combined via weighted sum:

y=iTopKgiEi(x),g=softmax(Wgx)y = \sum_{i \in \text{TopK}} g_i \cdot E_i(x), \quad g = \text{softmax}(W_g \cdot x)

The animation below shows the complete MoE forward pass:

Router Scoring
Step 1: Token enters Router, scores each ExpertToken xg = softmax(W_g · x)E00.05E10.12E20.35E30.08E40.22E50.03E60.11E70.04Router (small linear layer) outputs a probability score for each expert

Router Mechanism

The router design directly determines MoE effectiveness. There are two main approaches:

Token-Choice
Token-Choice Routing: Each token picks top-K expertsTokens:Experts:IlovelargemodelsE0E1E2E3From each token's perspective — simple, but expert load may be imbalancedAdopters: Mixtral (top-2), Switch Transformer (top-1), DeepSeek-V3 (top-8)

Token-choice routing (the mainstream approach): each token picks its top-K experts. Simple and efficient, but load may be uneven — some “popular” experts get selected by many tokens while others sit idle.

Expert-choice routing: reversed — each expert picks its top-K tokens. Load is naturally balanced, but some tokens may not be selected by any expert and get dropped.

In practice, token-choice is more common: Mixtral (top-2), Switch Transformer (top-1), DeepSeek-V3 (top-8).

Load Balancing

Without constraints, the router tends to concentrate most tokens on a few experts — the rich-get-richer phenomenon (also called expert collapse). This is clearly undesirable.

The solution is to add an auxiliary loss that encourages even expert load:

Laux=αNifiPiL_{aux} = \alpha \cdot N \sum_i f_i \cdot P_i

Here fif_i is the fraction of tokens actually received by expert ii, and PiP_i is the average probability the router assigns to expert ii. A larger α\alpha produces stronger balancing, but may hurt model quality (forced uniformity is not always natural).

Adjust the slider below to observe the effect of α\alpha on expert load distribution:

Expert Load Distribution (α = 0.000)No aux loss: few experts handle most tokens (expert collapse)ideal = 12.5%E055.2%E124.8%E211.1%E35.0%E42.2%E51.0%E60.5%E70.2%L_aux = α · N · Σᵢ fᵢ · Pᵢ (fᵢ = actual receive ratio, Pᵢ = router allocation prob mean)

Another technique is expert capacity: limiting each expert to process at most CC tokens per batch, with overflow tokens taking the residual path (skipped).

DeepSeek’s Innovations

DeepSeek introduced two key improvements to MoE:

DeepSeek MoE: Shared Expert + Fine-Grained Routed ExpertDeepSeek-V3: 1 shared expert + 256 routed experts, top-8Token xShared Expert所有 token 必经Router (Top-8)R0R1R2R3R4R5R6R7...×256OutputShared (所有 token) — 兜底通用知识Routed (Top-K 选择) — 细粒度专业化

Shared Expert

Some experts are traversed by all tokens (bypassing routing). These shared experts serve as a “safety net” — ensuring that foundational capabilities (general knowledge, grammar, common sense) are not lost due to routing fragmentation.

Fine-Grained Expert

Using more but smaller experts enables finer-grained specialization. DeepSeek-V2 uses 160 routed experts (compared to Mixtral’s 8), where each expert is smaller but more specialized.

Specific configurations:

  • DeepSeek-V2: 160 routed + 2 shared, top-6
  • DeepSeek-V3: 256 routed + 1 shared, top-8

Expert Parallelism

256 experts cannot fit on a single GPU — Expert Parallelism (EP) is needed: different experts are distributed across different GPUs.

Expert Parallelism: Expert 分布在不同 GPU 上GPU 0Expert 0Expert 1Token BufferGPU 1Expert 2Expert 3Token BufferGPU 2Expert 4Expert 5Token BufferGPU 3Expert 6Expert 7Token BufferdispatchcombineAll-to-All 通信:token dispatch → expert compute → result combine每个 GPU 上的 token 需要发送到对应 expert 所在 GPU,计算完再发回

This introduces All-to-All communication: tokens on each GPU need to be sent to the GPU hosting the corresponding expert (dispatch), and results are sent back after expert computation (combine).

Expert Parallelism is typically combined with Tensor Parallelism (TP) and Pipeline Parallelism (PP):

  • TP: Splits a single expert’s matrices across multiple GPUs
  • EP: Places different experts on different GPUs
  • PP: Places different layers on different GPUs

More experts and more GPUs mean heavier all-to-all communication — this is the primary deployment challenge for MoE.

Model Comparison Summary

主流 MoE 模型配置对比模型Total ParamsActive ParamsExpertsTop-KShared年份Switch Transformer1.6T~26B204812021Mixtral 8x7B47B~13B822024Mixtral 8x22B141B~39B822024DeepSeek-V2236B21B160622024DeepSeek-V3671B37B256812024Qwen2.5-MoE57B14B64882025