Positional Encoding — Giving Transformers a Sense of Order
Updated 2026-04-06
The Self-Attention mechanism in Transformers has a commonly overlooked yet critically important property: Permutation Invariance. Shuffling the order of input tokens does not change the Attention output — in other words, vanilla Attention has no awareness of token positions. This means a Transformer without positional encoding cannot distinguish “dog bites man” from “man bites dog.”
Positional Encoding is the solution to this problem. This article starts with Sinusoidal encoding, moves through Learned Embedding and relative position encoding, and ultimately provides a deep dive into RoPE (Rotary Position Embedding) — the dominant scheme used by today’s mainstream LLMs.
Why Positional Encoding Is Needed
Permutation Invariance of Attention
The Self-Attention computation is:
where , , . If we apply a permutation to the input sequence (i.e., shuffle the token order), with permutation matrix :
The output simply follows the permutation — the Attention scores between every pair of tokens remain completely unchanged.
Without positional encoding, attention scores depend only on token content
This means that without positional encoding, “The cat sat here” and “sat here The cat” are equivalent to the model. For language understanding, this is clearly unacceptable — we must explicitly inject position information.
Absolute Positional Encoding
Sinusoidal Encoding (Vaswani et al. 2017)
The original Transformer uses fixed sine/cosine functions to generate positional encodings:
Each dimension corresponds to a wave of different frequency. Low dimensions have high frequency (change rapidly), while high dimensions have low frequency (change slowly). Each position thus receives a unique “frequency fingerprint”:
Low dimensions (left) vary at high frequency, high dimensions (right) at low frequency — each position has a unique "frequency fingerprint"
Pros: No trainable parameters; can theoretically extrapolate to lengths unseen during training.
Cons: Extrapolation performance is limited in practice; has been gradually superseded by other approaches.
Learned Embedding
Another simple approach is to directly train a position embedding table , letting the model learn its own representation for each position.
- Pros: Simple to implement; typically outperforms Sinusoidal
- Cons: Sequence length is capped by from training; cannot handle longer inputs
BERT and GPT-2 both adopted this approach.
Relative Positional Encoding
Shaw et al. 2018 — The Shift from Absolute to Relative
Core observation: in natural language, the relative distance between tokens often matters more than absolute position. The syntactic relationship between “the cat” is the same whether these words appear at the beginning or end of a sentence.
Shaw et al.’s approach adds learnable relative position biases to the Attention scores:
where depends only on the value of (the distance), clipped to the range .
Pros: Naturally supports variable-length sequences; small parameter count (only bias vectors needed).
ALiBi (Press et al. 2022) — Extreme Simplification
ALiBi takes a more radical approach: no position embedding at all — instead, it directly subtracts a linear distance penalty from the Attention scores:
where is a fixed slope that differs per attention head (set in geometric progression, e.g., ).
- Pros: Zero trainable parameters, excellent extrapolation ability, extremely simple implementation
- Cons: Only has distance decay as a pattern, limited expressiveness
BLOOM and MPT adopted ALiBi.
RoPE — Rotary Position Embedding
RoPE (Su et al. 2021) is the most widely used positional encoding scheme today, adopted by LLaMA, Qwen, GPT-NeoX, and many other models.
Core Intuition
RoPE’s core idea is remarkably elegant: treat each pair of adjacent dimensions as a vector in a 2D plane, then rotate this vector based on the token’s position.
- Position corresponds to rotation angle
- Different dimension pairs use different base angles (similar to Sinusoidal’s multi-frequency concept)
- After rotation, the dot product of Q and K for two tokens depends only on the relative distance
View Q vector dimension pair (d₂ᵢ, d₂ᵢ₊₁) as vector on 2D plane
Dimension-Pair Frequency Decomposition
Each dimension pair has base angle . This formula means:
- Low dimensions (small ) = high-frequency rotation: A small change in position causes a large change in angle → captures local relationships
- High dimensions (large ) = low-frequency rotation: Position must change significantly for the angle to change noticeably → captures long-range relationships
This shares the same multi-frequency philosophy as Sinusoidal encoding, but implemented through rotation.
In the heatmap, the left side (low dimensions) shows rapid color changes — high-frequency signals; the right side (high dimensions) shows slow color changes — low-frequency signals. This multi-frequency encoding lets the model perceive both short-range and long-range positional relationships simultaneously.
Mathematical Derivation
For each dimension pair , define the rotation matrix:
where . Apply rotation to Q and K respectively:
Key property — the dot product after rotation depends only on relative position:
This follows from the rotation matrix property .
Complex Number Perspective
RoPE has an equivalent complex number representation that is more intuitive and efficient: treat each dimension pair as a complex number .
Rotation is simply complex multiplication:
Key derivation — the dot product after rotation depends only on relative position:
At the implementation level, this means no matrix multiplication is needed — just element-wise cos/sin operations — which is very efficient.
Length Extrapolation
RoPE’s positional angles during training are bounded. When inference sequences exceed the training length, the high-frequency components’ angle values go beyond the range seen during training, causing abnormal attention scores and degraded model performance.
The mainstream solutions are:
- NTK-aware scaling: Modify the frequency base (where is the scaling factor), “compressing” high-frequency components back into the training range
- YaRN (Yet another RoPE extensioN): A hybrid strategy — applies different scaling factors to different frequency components. High frequencies are kept unchanged (local relationships don’t need extrapolation), while low frequencies are scaled (long-range relationships need adaptation for longer contexts)
The visualization below shows how different scaling methods map out-of-range angles back into the training interval:
These methods can extend RoPE’s effective length from the training length by several to tens of times, enabling models like LLaMA to handle 100K+ token long documents.
Comparison Summary
| Method | Type | Trainable Params | Extrapolation | Compute Cost | Representative Models |
|---|---|---|---|---|---|
▶Sinusoidal Absolute None Limited Very Low Transformer (original) | |||||
▶Learned Absolute L_max × d None Very Low BERT, GPT-2 | |||||
▶Shaw (Relative) Relative 2K+1 biases Medium Medium Transformer-XL | |||||
▶ALiBi Relative None Strong Very Low BLOOM, MPT | |||||
▶RoPE Relative None Medium→Strong Low LLaMA, GPT-NeoX, Qwen | |||||
Selection advice:
- Training a new model from scratch → RoPE (current mainstream choice)
- Need extreme simplicity + strong extrapolation → ALiBi
- Legacy model compatibility → Follow the original scheme (Learned / Sinusoidal)