Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Transformer Architecture Overview

Transformer Architecture Overview

Updated 2026-04-08

Introduction

The Transformer is a neural network architecture proposed by Vaswani et al. in 2017 in the paper “Attention Is All You Need”. It is entirely based on the Attention mechanism, abandoning the previously dominant recurrent (RNN) and convolutional (CNN) structures, and has become the foundation of modern large language models (LLMs).

Well-known models such as GPT, LLaMA, and BERT are all built on the Transformer. Understanding the overall architecture of the Transformer is a prerequisite for diving deeper into advanced topics like Attention, KV Cache, and Flash Attention.

Intuitive Understanding: From RNN to Transformer

The Bottleneck of RNNs

RNNs (including LSTMs and GRUs) process sequences in temporal order: t1t2tnt_1 \to t_2 \to \cdots \to t_n. Each timestep must wait for the previous step to complete, resulting in two core problems:

  1. No parallelism: For a sequence of length nn, the computational complexity is O(n)O(n) sequential steps, unable to fully utilize GPU parallel capabilities.
  2. Long-range dependency decay: Information must be passed step by step, and is prone to loss or decay after multiple steps.

The Attention Solution

The Transformer replaces the recurrent structure with Self-Attention: each position can directly attend to any other position in the sequence, completing global information aggregation in one step.

  • Parallelism: Attention for all positions can be computed simultaneously
  • Long-range: The path length between any two positions is O(1)O(1)

This design enables the Transformer to far surpass RNNs in long sequence modeling and training efficiency.

Architecture Overview

The diagram below shows the internal structure of a Pre-LayerNorm variant of a Transformer Block. Modern LLMs (GPT-2, LLaMA, etc.) almost all adopt this variant, rather than the Post-LayerNorm from the original paper.

Transformer Block ×NInput Embeddings(B, S, H)Positional Encoding(B, S, H)LayerNorm(B, S, H)Multi-Head Self-Attention(B, S, H)⊕ Residual Add(B, S, H)LayerNorm(B, S, H)Feed-Forward (MLP)(B, S, 4H)→(B, S, H)⊕ Residual Add(B, S, H)residualresidual→ Output / Next Block
Pre-LayerNorm Transformer Block Architecture (Common Variant in Modern LLMs)

A complete Transformer model consists of NN such Blocks stacked together. The input first goes through Embedding and Positional Encoding, then passes through each Block layer by layer, and finally outputs for downstream tasks (such as next-token prediction in language modeling).

The diagram below shows the complete architecture of a Decoder-only LLM like GPT: the full pipeline from input text to next-token prediction. You can switch between different models to see their specific hyperparameter configurations.

Select model to view config:
Input TextTokenizer(B, S)Token Embedding [V×H](B, S, 768)+ Positional EncodingLearned Abs.×12Transformer Block #1Transformer Block #2Transformer Block #12Final LayerNormLNLM Head (Linear)(B, S, 50257)SoftmaxNext TokenBlock internalsLNSelf-Attention⊕ ResidualLNFFN (MLP) (GELU)⊕ Residual
Layers: 12Hidden dim: 768Vocab size: 50,257Context len: 1,024Total params: 117M
Complete Decoder-only LLM Architecture — Full data flow from input text to next token prediction

Does one Transformer Block count as “one layer”? It depends on context. When the industry refers to “layers” (e.g., LLaMA-7B has 32 layers), it means the number of Blocks — each Block internally contains two sub-layers (Self-Attention + FFN), but the entire Block counts as one layer. So “32 layers” = 32 Blocks = 64 sub-layers. The n_layers / num_hidden_layers parameter in papers and code refers to the number of Blocks.

Component Details

Input Embedding + Positional Encoding

Token Embedding maps discrete token IDs to continuous vectors:

xi=Embed(tokeni)RH\mathbf{x}_i = \text{Embed}(\text{token}_i) \in \mathbb{R}^{H}

where HH is the hidden size. For the entire sequence, the input tensor shape is:

Input:(B=batch,S=seq_len,H=hidden)

Positional Encoding injects positional information for each position. The original Transformer uses fixed sinusoidal/cosine encoding:

PE(pos,2i)=sin ⁣(pos100002i/dmodel),PE(pos,2i+1)=cos ⁣(pos100002i/dmodel)PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right), \quad PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

Modern models have developed various improved approaches:

ApproachRepresentative ModelsCharacteristics
Sinusoidal/Cosine (fixed)Original TransformerNot learnable, supports extrapolation
Learnable Absolute Position EncodingGPT-2, BERTDirectly learn a vector for each position
RoPE (Rotary Position Embedding)LLaMA, QwenEncodes relative positions via rotation in Attention computation
Sinusoidal (Fixed)
pos 0pos 1pos 2pos 3pos 4pos 5pos 6pos 7
Different frequency waves per dimension, fixed and non-learnable, supports extrapolation to longer sequences
Learned Absolute Position
pos 0pos 1pos 2pos 3pos 4pos 5pos 6pos 7
Learns a vector for each position during training, simple but poor extrapolation
RoPE (Rotary)
pos 0pos 1pos 2pos 3pos 4Adjacent tokens rotate by θ
Encodes relative position through rotation in Attention computation, balancing absolute position and relative distance

LayerNorm

Layer Normalization normalizes the hidden vector for each token:

LayerNorm(x)=xμσ+ϵγ+β\text{LayerNorm}(\mathbf{x}) = \frac{\mathbf{x} - \mu}{\sigma + \epsilon} \cdot \gamma + \beta

where μ,σ\mu, \sigma are the mean and standard deviation computed along the hidden dimension, and γ,β\gamma, \beta are learnable scale and shift parameters.

Pre-LN vs Post-LN:

VariantFormulaCharacteristics
Post-LN (Original Paper)LN(x+Attn(x))\text{LN}(\mathbf{x} + \text{Attn}(\mathbf{x}))LayerNorm is applied after residual addition
Pre-LN (Modern Variant)x+Attn(LN(x))\mathbf{x} + \text{Attn}(\text{LN}(\mathbf{x}))LayerNorm is applied before the sublayer
  • Original Transformer (2017) uses Post-LN: LN(x+SubLayer(x))\text{LN}(\mathbf{x} + \text{SubLayer}(\mathbf{x}))
  • Starting with GPT-2, most models switched to Pre-LN: x+SubLayer(LN(x))\mathbf{x} + \text{SubLayer}(\text{LN}(\mathbf{x})), because it offers more stable training
  • LLaMA further replaced LayerNorm with RMSNorm, removing mean centering and bias terms for more efficient computation
Post-LN (Original)Pre-LN (Modern)InputAttention+LayerNormFFN+LayerNormGradients must pass through LNInputLayerNormAttention+LayerNormFFN+Gradient Highway(Residual Shortcut Path)

Self-Attention (Overview)

Self-Attention is the core of the Transformer. Each position interacts with all positions in the sequence through three projections: Query, Key, and Value:

Attention(Q,K,V)=softmax ⁣(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V

where dkd_k is the dimension of each attention head. Multi-Head Attention splits the hidden dimension into multiple heads for parallel computation, then concatenates the outputs:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O

The details of Self-Attention (QKV data structures, computation flow, Multi-Head mechanism) will be covered in subsequent articles.

Feed-Forward Network (MLP)

The FFN in each Transformer Block is a position-wise two-layer fully connected network:

FFN(x)=GELU(xW1+b1)W2+b2\text{FFN}(\mathbf{x}) = \text{GELU}(\mathbf{x} W_1 + b_1) W_2 + b_2
  • W1RH×4HW_1 \in \mathbb{R}^{H \times 4H}: expands the dimension from HH to 4H4H (intermediate dimension)
  • W2R4H×HW_2 \in \mathbb{R}^{4H \times H}: compresses the dimension from 4H4H back to HH
  • Activation function: The original paper uses ReLU, GPT-2 uses GELU, LLaMA uses SwiGLU

The Evolution of Activation Functions: ReLU → GELU → SwiGLU

The choice of activation function directly impacts FFN expressiveness. LLMs have gone through three generations:

-101234-4-2024ReLUGELUSwish (SiLU)
Swish (SiLU)
f(x) = x · σ(x)
ProsSmooth; self-gated — the input itself controls how much passes through; basis for SwiGLU
ConsNegative values in the negative region may introduce noise; alone not as good as SwiGLU
Used byActivation basis of SwiGLU → LLaMA, Mistral, Gemma

FFN Architecture Comparison

Standard FFN (2 weight matrices)xW₁ (H→4H)GELUW₂ (4H→H)outSwiGLU FFN (3 weight matrices)xW_gate (H→d)SwishW_up (H→d)W_down (d→H)out

Why does SwiGLU win? The key is the gating mechanism (Gated Linear Unit). A standard FFN uses 2 weight matrices: xW1xW_1 through an activation then multiplied by W2W_2. SwiGLU uses 3 weight matrices, with the additional WgateW_{\text{gate}} forming a data-dependent gate:

SwiGLU(x)=Swish(xWgate)gate: controls "how much" passes(xWup)value: provides "what" passes\text{SwiGLU}(\mathbf{x}) = \underbrace{\text{Swish}(\mathbf{x} W_{\text{gate}})}_{\text{gate: controls "how much" passes}} \odot \underbrace{(\mathbf{x} W_{\text{up}})}_{\text{value: provides "what" passes}}

This gating allows the network to selectively suppress or amplify information across different dimensions, while ReLU/GELU only apply independent nonlinear transformations per dimension. Shazeer (2020, “GLU Variants Improve Transformer”) demonstrated that SwiGLU significantly outperforms GELU and ReLU in perplexity at the same parameter count.

The parameter trade-off: SwiGLU adds an extra WgateW_{\text{gate}} matrix (3 matrices vs 2), so to maintain the same total parameter count, the FFN intermediate dimension is reduced from 4H4H to approximately 83H\frac{8}{3}H. This is why LLaMA-7B’s intermediate dimension is 11008 (83×4096\approx \frac{8}{3} \times 4096) rather than the GPT-style 4×4096=163844 \times 4096 = 16384.

Tensor shape transformation process:

FFN Input:(B=batch,S=seq_len,H=hidden)

→ Through W1W_1

Hidden Layer:(B=batch,S=seq_len,4H=intermediate)

→ Through W2W_2

FFN Output:(B=batch,S=seq_len,H=hidden)
Input(B, S, H)Linear₁(B, S, 4H)GELU(B, S, 4H)Linear₂(B, S, H)↑ "Diamond" Structure: Expand then Compress ↑

Residual Connection

Each sublayer (Self-Attention and FFN) is wrapped with a residual connection:

output=x+SubLayer(x)\mathbf{output} = \mathbf{x} + \text{SubLayer}(\mathbf{x})

The role of residual connections:

  1. Mitigate gradient vanishing: Gradients can be directly backpropagated through “skip connections”, avoiding gradient decay in deep networks
  2. Preserve information flow: Even if the sublayer output is zero, the original information is not lost
  3. Enable deep stacking: Without residual connections, training 32 or even 126 layers would be nearly impossible

Note: Residual connections require that the input and output dimensions of the sublayer are the same, which is why the FFN output dimension must return to HH.

Tensor Shape Tracking

Taking GPT-2 Small (H=768H=768, sequence length S=1024S=1024, batch size B=1B=1) as an example, track the complete data flow:

StageTensor ShapeDescription
Token IDs(1,1024)(1, 1024)Integer sequence
Token Embedding(1,1024,768)(1, 1024, 768)Vectors from lookup table
+ Positional Encoding(1,1024,768)(1, 1024, 768)Add positional vectors
LayerNorm(1,1024,768)(1, 1024, 768)Shape unchanged
Q, K, V ProjectionsEach (1,1024,768)(1, 1024, 768)Linear transformation
Split into 12 headsEach (1,12,1024,64)(1, 12, 1024, 64)768/12=64768 / 12 = 64 per head
Attention Output(1,1024,768)(1, 1024, 768)Concatenate all heads
Residual Add(1,1024,768)(1, 1024, 768)Add input
LayerNorm(1,1024,768)(1, 1024, 768)Shape unchanged
FFN Hidden Layer(1,1024,3072)(1, 1024, 3072)768×4=3072768 \times 4 = 3072
FFN Output(1,1024,768)(1, 1024, 768)Compress back to original dimension
Residual Add(1,1024,768)(1, 1024, 768)Add input

This process repeats 12 times (12 layers), and the final output tensor of (1,1024,768)(1, 1024, 768) is sent to the LM Head for next-token prediction.

Encoder-Decoder vs Decoder-only

Original Transformer: Encoder-Decoder

The original Transformer was designed for machine translation, adopting an Encoder-Decoder structure:

  • Encoder (6 layers): Bidirectional Self-Attention on the source sequence (each position can attend to all positions)
  • Decoder (6 layers):
    • Masked Self-Attention: Can only attend to the current position and previous positions (causal mask), preventing information leakage
    • Cross-Attention: Attends to Encoder output to obtain source sequence information

Decoder-only: The Mainstream for Modern LLMs

The GPT series pioneered the Decoder-only architecture, removing the Encoder and Cross-Attention:

  • Only retains Masked Self-Attention (causal attention)
  • Each token can only see itself and previous tokens
  • Unifies “understanding” and “generation”: uses the same architecture for all tasks
StructureRepresentative ModelsAttention TypeTypical Applications
Encoder-DecoderOriginal Transformer, T5, BARTBidirectional + Causal + CrossTranslation, Summarization
Encoder-onlyBERT, RoBERTaBidirectionalClassification, NLU
Decoder-onlyGPT series, LLaMA, QwenCausal (Masked)Text Generation, General LLMs
Bidirectional (Encoder)
IloveNLPandML!IloveNLPandML!
All positions visible to each other — used by BERT and other Encoder models
Causal (Decoder-only)
IloveNLPandML!IloveNLPandML!
Each token can only see itself and previous positions — used by GPT, LLaMA, etc.
Cross (Encoder-Decoder)
IloveNLPandML!I→love→NLP→and→ML→!→
Decoder can see all Encoder positions — used by original Transformer, T5
Visible Masked

Currently, almost all mainstream LLMs (GPT-4, Claude, LLaMA, Qwen, Gemini) adopt the Decoder-only architecture.

Comparison of Typical Model Hyperparameters

ParameterOriginal TransformerGPT-2 SmallGPT-2 XLLLaMA-7BLLaMA-3.1-8B
Hidden Dimension HH512768160040964096
Number of Layers LL6 (Enc+Dec)12483232
Attention Heads hh812253232
Head Dimension646464128128
FFN Hidden Dimension2048307264001100814336
Vocabulary Size37000502575025732000128256
Context LengthN/A102410242048131072
Total Parameters65M117M1.5B6.7B8B
LayerNormPost-LNPre-LNPre-LNPre-RMSNormPre-RMSNorm
Activation FunctionReLUGELUGELUSwiGLUSwiGLU
Positional EncodingSinusoidal/CosineLearnable AbsoluteLearnable AbsoluteRoPERoPE

Note: Head Dimension = H/hH / h. The LLaMA series uses GQA (Grouped-Query Attention), where LLaMA-3.1-8B has 8 KV heads (rather than 32).

If you want to dive deeper into the Transformer architecture, here are our curated resources:

Classic Papers

  • Vaswani et al. “Attention Is All You Need” — The foundational paper for the Transformer architecture, proposing self-attention and multi-head attention mechanisms, the source of all subsequent work.
  • Lilian Weng “The Transformer Family Version 2.0” — A systematic overview of various Transformer variants and improvements, covering long context, efficient attention, adaptive modeling, and other directions, with rich references.

Video Courses

  • 3Blue1Brown — Attention in Transformers (Neural Networks series) — Known for beautiful mathematical animations, provides an intuitive explanation of the geometric meaning of the attention mechanism. Chapter 6 focuses on attention, with excellent visual effects.
  • Andrej Karpathy “Neural Networks: Zero to Hero” — Build GPT from scratch, the “Let’s build GPT” episode (about 2 hours) implements the Transformer from the ground up, the best resource for understanding implementation details. Comes with GitHub code and Jupyter Notebook.

Blogs and Tutorials (Illustrated)

  • Jay Alammar “The Illustrated Transformer”An exemplary work of illustrated content. Through dozens of carefully drawn diagrams, it step-by-step deconstructs the Transformer architecture, including encoder-decoder stacks, Q/K/V computation in self-attention, multi-head attention, positional encoding, etc. Widely recognized as the best illustrated introduction.
  • Jay Alammar “Visualizing Neural Machine Translation” — Prerequisite reading for Illustrated Transformer, illustrating how seq2seq + attention mechanism works, helping understand the origin of attention.
  • Harvard NLP “The Annotated Transformer” — Line-by-line code annotation implementation of the original paper (PyTorch), matching formulas from the paper with code one-to-one, suitable for readers who want to understand the meaning of every line of code.

Interactive Experiments

  • Transformer Explainer (Georgia Tech / Polo Club) — Run GPT-2 model in browser, observe the attention computation process in real-time. Published at IEEE VIS 2024, with excellent interactive experience. (poloclub.github.io/transformer-explainer/)
  • Brendan Bycroft “LLM Visualization” — 3D interactive visualization of GPT inference process, layer-by-layer observation of data flow and matrix operations, with stunning visual effects. (bbycroft.net/llm)
  • Financial Times “Generative AI Explained” — Visual narrative created by FT data journalism team, showcasing how LLMs work with beautiful interactive animations. (ig.ft.com/generative-ai/)

Summary

This article outlines the overall architecture of the Transformer:

  1. Core idea: Replace recurrent structure with Self-Attention, enabling parallel computation and global information aggregation
  2. Block composition: Each Transformer Block contains LayerNorm → Self-Attention → Residual → LayerNorm → FFN → Residual
  3. Modern evolution: From Post-LN to Pre-LN, from LayerNorm to RMSNorm, from ReLU to SwiGLU, from fixed positional encoding to RoPE
  4. Architecture choice: Modern LLMs almost all adopt the Decoder-only architecture

In subsequent articles, we will dive into the details of each component:

  • QKV Data Structures and Intuition: Understanding what Query, Key, and Value really are
  • Attention Computation Details: The complete flow of QKTQK^T matrix multiplication
  • Multi-Head Attention: The principles and implementation of parallel multi-head computation
  • KV Cache: Key technology for inference acceleration