Transformer Core Mechanisms
Deep dive into every component of the Transformer, from architecture to attention
- 1
Transformer Architecture Overview
Intermediate#transformer#architecture - 2
QKV Data Structures and Intuition
Intermediate#transformer#attention#qkv - 3
Attention Computation in Detail
Intermediate#transformer#attention#softmax - 4
Multi-Head Attention
Intermediate#transformer#attention#multi-head - 5
MQA and GQA
Advanced#transformer#attention#mqa#gqa#kv-cache - 6
Attention Variants: From Sliding Window to MLA
Advanced#transformer#attention#mla#sliding-window#cross-attention - 7
KV Cache Fundamentals
Advanced#inference#kv-cache#memory#optimization - 8
Prefill vs Decode Phases
Intermediate#inference#prefill#decode#performance - 9
Flash Attention Tiling Principles
Advanced#attention#hardware-optimization#flash-attention#memory - 10
Positional Encoding β Giving Transformers a Sense of Order
Intermediate#transformer#attention#positional-encoding - 11
Sampling & Decoding β From Probabilities to Text
Intermediate#inference#sampling#decoding#perplexity - 12
Speculative Decoding β Accelerating LLM Inference via Guessing
Advanced#inference#optimization#speculative-decoding - 13
Mixture of Experts: Sparsely Activated Large Model Architecture
Advanced#transformer#moe#routing#deepseek#mixtral - 14
State Space Models and Mamba
Advanced#ssm#mamba#state-space-model#selective-scan#sequence-modeling - 15
Hybrid Architectures: Fusing Mamba with Attention
Advanced#hybrid#mamba#jamba#zamba#hymba#architecture - 16
Qwen3-Coder-Next Architecture: When SSM, Attention, and MoE Converge
Advanced#hybrid#moe#ssm#deltanet#qwen#architecture