Transformer Architecture Overview

Introduction

The Transformer is a neural network architecture proposed by Vaswani et al. in 2017 in the paper “Attention Is All You Need”. It is entirely based on the Attention mechanism, abandoning the previously dominant recurrent (RNN) and convolutional (CNN) structures, and has become the foundation of modern large language models (LLMs).

Well-known models such as GPT, LLaMA, and BERT are all built on the Transformer. Understanding the overall architecture of the Transformer is a prerequisite for diving deeper into advanced topics like Attention, KV Cache, and Flash Attention.

Intuitive Understanding: From RNN to Transformer

The Bottleneck of RNNs

RNNs (including LSTMs and GRUs) process sequences in temporal order: $t_1 \to t_2 \to \cdots \to t_n$ . Each timestep must wait for the previous step to complete, resulting in two core problems:

No parallelism: For a sequence of length $n$ , the computational complexity is $O(n)$ sequential steps, unable to fully utilize GPU parallel capabilities.
Long-range dependency decay: Information must be passed step by step, and is prone to loss or decay after multiple steps.

The Attention Solution

The Transformer replaces the recurrent structure with Self-Attention: each position can directly attend to any other position in the sequence, completing global information aggregation in one step.

Parallelism: Attention for all positions can be computed simultaneously
Long-range: The path length between any two positions is $O(1)$

This design enables the Transformer to far surpass RNNs in long sequence modeling and training efficiency.

Architecture Overview

The diagram below shows the internal structure of a Pre-LayerNorm variant of a Transformer Block. Modern LLMs (GPT-2, LLaMA, etc.) almost all adopt this variant, rather than the Post-LayerNorm from the original paper.

Pre-LayerNorm Transformer Block Architecture (Common Variant in Modern LLMs)

A complete Transformer model consists of $N$ such Blocks stacked together. The input first goes through Embedding and Positional Encoding, then passes through each Block layer by layer, and finally outputs for downstream tasks (such as next-token prediction in language modeling).

The diagram below shows the complete architecture of a Decoder-only LLM like GPT: the full pipeline from input text to next-token prediction. You can switch between different models to see their specific hyperparameter configurations.

Select model to view config:

Layers: 12Hidden dim: 768Vocab size: 50,257Context len: 1,024Total params: 117M

Complete Decoder-only LLM Architecture — Full data flow from input text to next token prediction

Does one Transformer Block count as “one layer”? It depends on context. When the industry refers to “layers” (e.g., LLaMA-7B has 32 layers), it means the number of Blocks — each Block internally contains two sub-layers (Self-Attention + FFN), but the entire Block counts as one layer. So “32 layers” = 32 Blocks = 64 sub-layers. The n_layers / num_hidden_layers parameter in papers and code refers to the number of Blocks.

Component Details

Input Embedding + Positional Encoding

Token Embedding maps discrete token IDs to continuous vectors:

\mathbf{x}_i = \text{Embed}(\text{token}_i) \in \mathbb{R}^{H}

where $H$ is the hidden size. For the entire sequence, the input tensor shape is:

Input:(B=batch,S=seq_len,H=hidden)

Positional Encoding injects positional information for each position. The original Transformer uses fixed sinusoidal/cosine encoding:

PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right), \quad PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

Modern models have developed various improved approaches:

Approach	Representative Models	Characteristics
Sinusoidal/Cosine (fixed)	Original Transformer	Not learnable, supports extrapolation
Learnable Absolute Position Encoding	GPT-2, BERT	Directly learn a vector for each position
RoPE (Rotary Position Embedding)	LLaMA, Qwen	Encodes relative positions via rotation in Attention computation

Sinusoidal (Fixed)

Different frequency waves per dimension, fixed and non-learnable, supports extrapolation to longer sequences

Learned Absolute Position

Learns a vector for each position during training, simple but poor extrapolation

RoPE (Rotary)

Encodes relative position through rotation in Attention computation, balancing absolute position and relative distance

LayerNorm

Layer Normalization normalizes the hidden vector for each token:

\text{LayerNorm}(\mathbf{x}) = \frac{\mathbf{x} - \mu}{\sigma + \epsilon} \cdot \gamma + \beta

where $\mu, \sigma$ are the mean and standard deviation computed along the hidden dimension, and $\gamma, \beta$ are learnable scale and shift parameters.

Pre-LN vs Post-LN:

Variant	Formula	Characteristics
Post-LN (Original Paper)	$\text{LN}(\mathbf{x} + \text{Attn}(\mathbf{x}))$	LayerNorm is applied after residual addition
Pre-LN (Modern Variant)	$\mathbf{x} + \text{Attn}(\text{LN}(\mathbf{x}))$	LayerNorm is applied before the sublayer

Original Transformer (2017) uses Post-LN: $\text{LN}(\mathbf{x} + \text{SubLayer}(\mathbf{x}))$
Starting with GPT-2, most models switched to Pre-LN: $\mathbf{x} + \text{SubLayer}(\text{LN}(\mathbf{x}))$ , because it offers more stable training
LLaMA further replaced LayerNorm with RMSNorm, removing mean centering and bias terms for more efficient computation

Self-Attention (Overview)

Self-Attention is the core of the Transformer. Each position interacts with all positions in the sequence through three projections: Query, Key, and Value:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V

where $d_k$ is the dimension of each attention head. Multi-Head Attention splits the hidden dimension into multiple heads for parallel computation, then concatenates the outputs:

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O

The details of Self-Attention (QKV data structures, computation flow, Multi-Head mechanism) will be covered in subsequent articles.

Feed-Forward Network (MLP)

The FFN in each Transformer Block is a position-wise two-layer fully connected network:

\text{FFN}(\mathbf{x}) = \text{GELU}(\mathbf{x} W_1 + b_1) W_2 + b_2

$W_1 \in \mathbb{R}^{H \times 4H}$ : expands the dimension from $H$ to $4H$ (intermediate dimension)
$W_2 \in \mathbb{R}^{4H \times H}$ : compresses the dimension from $4H$ back to $H$
Activation function: The original paper uses ReLU, GPT-2 uses GELU, LLaMA uses SwiGLU

The Evolution of Activation Functions: ReLU → GELU → SwiGLU

The choice of activation function directly impacts FFN expressiveness. LLMs have gone through three generations:

Swish (SiLU)

f(x) = x · σ(x)

Pros：Smooth; self-gated — the input itself controls how much passes through; basis for SwiGLU

Cons：Negative values in the negative region may introduce noise; alone not as good as SwiGLU

Used by：Activation basis of SwiGLU → LLaMA, Mistral, Gemma

FFN Architecture Comparison

Why does SwiGLU win? The key is the gating mechanism (Gated Linear Unit). A standard FFN uses 2 weight matrices: $xW_1$ through an activation then multiplied by $W_2$ . SwiGLU uses 3 weight matrices, with the additional $W_{\text{gate}}$ forming a data-dependent gate:

\text{SwiGLU}(\mathbf{x}) = \underbrace{\text{Swish}(\mathbf{x} W_{\text{gate}})}_{\text{gate: controls "how much" passes}} \odot \underbrace{(\mathbf{x} W_{\text{up}})}_{\text{value: provides "what" passes}}

This gating allows the network to selectively suppress or amplify information across different dimensions, while ReLU/GELU only apply independent nonlinear transformations per dimension. Shazeer (2020, “GLU Variants Improve Transformer”) demonstrated that SwiGLU significantly outperforms GELU and ReLU in perplexity at the same parameter count.

The parameter trade-off: SwiGLU adds an extra $W_{\text{gate}}$ matrix (3 matrices vs 2), so to maintain the same total parameter count, the FFN intermediate dimension is reduced from $4H$ to approximately $\frac{8}{3}H$ . This is why LLaMA-7B’s intermediate dimension is 11008 ( $\approx \frac{8}{3} \times 4096$ ) rather than the GPT-style $4 \times 4096 = 16384$ .

Tensor shape transformation process:

FFN Input:(B=batch,S=seq_len,H=hidden)

→ Through $W_1$ →

Hidden Layer:(B=batch,S=seq_len,4H=intermediate)

→ Through $W_2$ →

FFN Output:(B=batch,S=seq_len,H=hidden)

Residual Connection

Each sublayer (Self-Attention and FFN) is wrapped with a residual connection:

\mathbf{output} = \mathbf{x} + \text{SubLayer}(\mathbf{x})

The role of residual connections:

Mitigate gradient vanishing: Gradients can be directly backpropagated through “skip connections”, avoiding gradient decay in deep networks
Preserve information flow: Even if the sublayer output is zero, the original information is not lost
Enable deep stacking: Without residual connections, training 32 or even 126 layers would be nearly impossible

Note: Residual connections require that the input and output dimensions of the sublayer are the same, which is why the FFN output dimension must return to $H$ .

Tensor Shape Tracking

Taking GPT-2 Small ( $H=768$ , sequence length $S=1024$ , batch size $B=1$ ) as an example, track the complete data flow:

Stage	Tensor Shape	Description
Token IDs	$(1, 1024)$	Integer sequence
Token Embedding	$(1, 1024, 768)$	Vectors from lookup table
+ Positional Encoding	$(1, 1024, 768)$	Add positional vectors
LayerNorm	$(1, 1024, 768)$	Shape unchanged
Q, K, V Projections	Each $(1, 1024, 768)$	Linear transformation
Split into 12 heads	Each $(1, 12, 1024, 64)$	$768 / 12 = 64$ per head
Attention Output	$(1, 1024, 768)$	Concatenate all heads
Residual Add	$(1, 1024, 768)$	Add input
LayerNorm	$(1, 1024, 768)$	Shape unchanged
FFN Hidden Layer	$(1, 1024, 3072)$	$768 \times 4 = 3072$
FFN Output	$(1, 1024, 768)$	Compress back to original dimension
Residual Add	$(1, 1024, 768)$	Add input

This process repeats 12 times (12 layers), and the final output tensor of $(1, 1024, 768)$ is sent to the LM Head for next-token prediction.

Encoder-Decoder vs Decoder-only

Original Transformer: Encoder-Decoder

The original Transformer was designed for machine translation, adopting an Encoder-Decoder structure:

Encoder (6 layers): Bidirectional Self-Attention on the source sequence (each position can attend to all positions)
Decoder (6 layers):
- Masked Self-Attention: Can only attend to the current position and previous positions (causal mask), preventing information leakage
- Cross-Attention: Attends to Encoder output to obtain source sequence information

Decoder-only: The Mainstream for Modern LLMs

The GPT series pioneered the Decoder-only architecture, removing the Encoder and Cross-Attention:

Only retains Masked Self-Attention (causal attention)
Each token can only see itself and previous tokens
Unifies “understanding” and “generation”: uses the same architecture for all tasks

Structure	Representative Models	Attention Type	Typical Applications
Encoder-Decoder	Original Transformer, T5, BART	Bidirectional + Causal + Cross	Translation, Summarization
Encoder-only	BERT, RoBERTa	Bidirectional	Classification, NLU
Decoder-only	GPT series, LLaMA, Qwen	Causal (Masked)	Text Generation, General LLMs

Bidirectional (Encoder)

All positions visible to each other — used by BERT and other Encoder models

Causal (Decoder-only)

Each token can only see itself and previous positions — used by GPT, LLaMA, etc.

Cross (Encoder-Decoder)

Decoder can see all Encoder positions — used by original Transformer, T5

Visible Masked

Currently, almost all mainstream LLMs (GPT-4, Claude, LLaMA, Qwen, Gemini) adopt the Decoder-only architecture.

Comparison of Typical Model Hyperparameters

Parameter	Original Transformer	GPT-2 Small	GPT-2 XL	LLaMA-7B	LLaMA-3.1-8B
Hidden Dimension $H$	512	768	1600	4096	4096
Number of Layers $L$	6 (Enc+Dec)	12	48	32	32
Attention Heads $h$	8	12	25	32	32
Head Dimension	64	64	64	128	128
FFN Hidden Dimension	2048	3072	6400	11008	14336
Vocabulary Size	37000	50257	50257	32000	128256
Context Length	N/A	1024	1024	2048	131072
Total Parameters	65M	117M	1.5B	6.7B	8B
LayerNorm	Post-LN	Pre-LN	Pre-LN	Pre-RMSNorm	Pre-RMSNorm
Activation Function	ReLU	GELU	GELU	SwiGLU	SwiGLU
Positional Encoding	Sinusoidal/Cosine	Learnable Absolute	Learnable Absolute	RoPE	RoPE

Note: Head Dimension = $H / h$ . The LLaMA series uses GQA (Grouped-Query Attention), where LLaMA-3.1-8B has 8 KV heads (rather than 32).

Recommended Learning Resources

If you want to dive deeper into the Transformer architecture, here are our curated resources:

Classic Papers

Vaswani et al. “Attention Is All You Need” — The foundational paper for the Transformer architecture, proposing self-attention and multi-head attention mechanisms, the source of all subsequent work.
Lilian Weng “The Transformer Family Version 2.0” — A systematic overview of various Transformer variants and improvements, covering long context, efficient attention, adaptive modeling, and other directions, with rich references.

Video Courses

3Blue1Brown — Attention in Transformers (Neural Networks series) — Known for beautiful mathematical animations, provides an intuitive explanation of the geometric meaning of the attention mechanism. Chapter 6 focuses on attention, with excellent visual effects.
Andrej Karpathy “Neural Networks: Zero to Hero” — Build GPT from scratch, the “Let’s build GPT” episode (about 2 hours) implements the Transformer from the ground up, the best resource for understanding implementation details. Comes with GitHub code and Jupyter Notebook.

Blogs and Tutorials (Illustrated)

Jay Alammar “The Illustrated Transformer” — An exemplary work of illustrated content. Through dozens of carefully drawn diagrams, it step-by-step deconstructs the Transformer architecture, including encoder-decoder stacks, Q/K/V computation in self-attention, multi-head attention, positional encoding, etc. Widely recognized as the best illustrated introduction.
Jay Alammar “Visualizing Neural Machine Translation” — Prerequisite reading for Illustrated Transformer, illustrating how seq2seq + attention mechanism works, helping understand the origin of attention.
Harvard NLP “The Annotated Transformer” — Line-by-line code annotation implementation of the original paper (PyTorch), matching formulas from the paper with code one-to-one, suitable for readers who want to understand the meaning of every line of code.

Interactive Experiments

Transformer Explainer (Georgia Tech / Polo Club) — Run GPT-2 model in browser, observe the attention computation process in real-time. Published at IEEE VIS 2024, with excellent interactive experience. (poloclub.github.io/transformer-explainer/)
Brendan Bycroft “LLM Visualization” — 3D interactive visualization of GPT inference process, layer-by-layer observation of data flow and matrix operations, with stunning visual effects. (bbycroft.net/llm)
Financial Times “Generative AI Explained” — Visual narrative created by FT data journalism team, showcasing how LLMs work with beautiful interactive animations. (ig.ft.com/generative-ai/)

Summary

This article outlines the overall architecture of the Transformer:

Core idea: Replace recurrent structure with Self-Attention, enabling parallel computation and global information aggregation
Block composition: Each Transformer Block contains LayerNorm → Self-Attention → Residual → LayerNorm → FFN → Residual
Modern evolution: From Post-LN to Pre-LN, from LayerNorm to RMSNorm, from ReLU to SwiGLU, from fixed positional encoding to RoPE
Architecture choice: Modern LLMs almost all adopt the Decoder-only architecture

In subsequent articles, we will dive into the details of each component:

QKV Data Structures and Intuition: Understanding what Query, Key, and Value really are
Attention Computation Details: The complete flow of $QK^T$ matrix multiplication
Multi-Head Attention: The principles and implementation of parallel multi-head computation
KV Cache: Key technology for inference acceleration