Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Positional Encoding — Giving Transformers a Sense of Order

Positional Encoding — Giving Transformers a Sense of Order

Updated 2026-04-06

The Self-Attention mechanism in Transformers has a commonly overlooked yet critically important property: Permutation Invariance. Shuffling the order of input tokens does not change the Attention output — in other words, vanilla Attention has no awareness of token positions. This means a Transformer without positional encoding cannot distinguish “dog bites man” from “man bites dog.”

Positional Encoding is the solution to this problem. This article starts with Sinusoidal encoding, moves through Learned Embedding and relative position encoding, and ultimately provides a deep dive into RoPE (Rotary Position Embedding) — the dominant scheme used by today’s mainstream LLMs.

Why Positional Encoding Is Needed

Permutation Invariance of Attention

The Self-Attention computation is: Attn(Q,K,V)=softmax(QKTdk)V\text{Attn}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

where Q=XWQQ = XW_Q, K=XWKK = XW_K, V=XWVV = XW_V. If we apply a permutation π\pi to the input sequence (i.e., shuffle the token order), with permutation matrix PP:

Attn(PX)=PAttn(X)\text{Attn}(PX) = P \cdot \text{Attn}(X)

The output simply follows the permutation — the Attention scores between every pair of tokens remain completely unchanged.

Attention on Original Sequence
Thecatsathere
ThecatsathereThecatsathere0.250.350.200.200.300.250.250.200.200.300.300.200.150.250.250.35

Without positional encoding, attention scores depend only on token content

This means that without positional encoding, “The cat sat here” and “sat here The cat” are equivalent to the model. For language understanding, this is clearly unacceptable — we must explicitly inject position information.

Absolute Positional Encoding

Sinusoidal Encoding (Vaswani et al. 2017)

The original Transformer uses fixed sine/cosine functions to generate positional encodings:

PE(pos,2i)=sin(pos100002i/d),PE(pos,2i+1)=cos(pos100002i/d)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)

Each dimension corresponds to a wave of different frequency. Low dimensions have high frequency (change rapidly), while high dimensions have low frequency (change slowly). Each position thus receives a unique “frequency fingerprint”:

PositionDimension (i)015314763014304662
■ −1
+1 ■

Low dimensions (left) vary at high frequency, high dimensions (right) at low frequency — each position has a unique "frequency fingerprint"

Pros: No trainable parameters; can theoretically extrapolate to lengths unseen during training.

Cons: Extrapolation performance is limited in practice; has been gradually superseded by other approaches.

Learned Embedding

Another simple approach is to directly train a position embedding table ERLmax×dE \in \mathbb{R}^{L_{max} \times d}, letting the model learn its own representation for each position.

  • Pros: Simple to implement; typically outperforms Sinusoidal
  • Cons: Sequence length is capped by LmaxL_{max} from training; cannot handle longer inputs

BERT and GPT-2 both adopted this approach.

Relative Positional Encoding

Shaw et al. 2018 — The Shift from Absolute to Relative

Core observation: in natural language, the relative distance between tokens often matters more than absolute position. The syntactic relationship between “the cat” is the same whether these words appear at the beginning or end of a sentence.

Shaw et al.’s approach adds learnable relative position biases aijKa_{ij}^K to the Attention scores:

eij=xiWQ(xjWK+aijK)Tdke_{ij} = \frac{x_i W_Q (x_j W_K + a_{ij}^K)^T}{\sqrt{d_k}}

where aijKa_{ij}^K depends only on the value of iji - j (the distance), clipped to the range [K,K][-K, K].

Pros: Naturally supports variable-length sequences; small parameter count (only 2K+12K + 1 bias vectors needed).

ALiBi (Press et al. 2022) — Extreme Simplification

ALiBi takes a more radical approach: no position embedding at all — instead, it directly subtracts a linear distance penalty from the Attention scores:

scoreij=qikjmij\text{score}_{ij} = q_i \cdot k_j - m \cdot |i - j|

where mm is a fixed slope that differs per attention head (set in geometric progression, e.g., m{21,22,,2H}m \in \{2^{-1}, 2^{-2}, \ldots, 2^{-H}\}).

  • Pros: Zero trainable parameters, excellent extrapolation ability, extremely simple implementation
  • Cons: Only has distance decay as a pattern, limited expressiveness

BLOOM and MPT adopted ALiBi.

RoPE — Rotary Position Embedding

RoPE (Su et al. 2021) is the most widely used positional encoding scheme today, adopted by LLaMA, Qwen, GPT-NeoX, and many other models.

Core Intuition

RoPE’s core idea is remarkably elegant: treat each pair of adjacent dimensions (d2i,d2i+1)(d_{2i}, d_{2i+1}) as a vector in a 2D plane, then rotate this vector based on the token’s position.

  • Position mm corresponds to rotation angle mθim\theta_i
  • Different dimension pairs use different base angles θi\theta_i (similar to Sinusoidal’s multi-frequency concept)
  • After rotation, the dot product of Q and K for two tokens depends only on the relative distance mnm - n
Q vector (position 0)
Q₀Position 0: rotation angle = 0

View Q vector dimension pair (d₂ᵢ, d₂ᵢ₊₁) as vector on 2D plane

Dimension-Pair Frequency Decomposition

Each dimension pair (d2i,d2i+1)(d_{2i}, d_{2i+1}) has base angle θi=100002i/d\theta_i = 10000^{-2i/d}. This formula means:

  • Low dimensions (small ii) = high-frequency rotation: A small change in position causes a large change in angle → captures local relationships
  • High dimensions (large ii) = low-frequency rotation: Position must change significantly for the angle to change noticeably → captures long-range relationships

This shares the same multi-frequency philosophy as Sinusoidal encoding, but implemented through rotation.

RoPE Rotation Angle Heatmap per Dim Pair (d_model=64)Dim pair index i →Position pos →High freq (changes fast)Low freq (changes slow)Angles mθᵢ (mod 2π) at pos=10i=0: large angle changei=31: small angle change

In the heatmap, the left side (low dimensions) shows rapid color changes — high-frequency signals; the right side (high dimensions) shows slow color changes — low-frequency signals. This multi-frequency encoding lets the model perceive both short-range and long-range positional relationships simultaneously.

Mathematical Derivation

For each dimension pair (2i,2i+1)(2i, 2i+1), define the rotation matrix:

Rθi(m)=(cosmθisinmθisinmθicosmθi)R_{\theta_i}(m) = \begin{pmatrix} \cos m\theta_i & -\sin m\theta_i \\ \sin m\theta_i & \cos m\theta_i \end{pmatrix}

where θi=100002i/d\theta_i = 10000^{-2i/d}. Apply rotation to Q and K respectively:

q~m=Rθ(m)qm,k~n=Rθ(n)kn\tilde{q}_m = R_{\theta}(m) \, q_m, \quad \tilde{k}_n = R_{\theta}(n) \, k_n

Key property — the dot product after rotation depends only on relative position:

q~mTk~n=qmTRθ(m)TRθ(n)kn=qmTRθ(nm)kn\tilde{q}_m^T \tilde{k}_n = q_m^T R_{\theta}(m)^T R_{\theta}(n) \, k_n = q_m^T R_{\theta}(n - m) \, k_n

This follows from the rotation matrix property R(α)TR(β)=R(βα)R(\alpha)^T R(\beta) = R(\beta - \alpha).

Complex Number Perspective

RoPE has an equivalent complex number representation that is more intuitive and efficient: treat each dimension pair (q2i,q2i+1)(q_{2i}, q_{2i+1}) as a complex number q2i+q2i+1jq_{2i} + q_{2i+1} \cdot j.

Rotation is simply complex multiplication:

q~=qeimθ\tilde{q} = q \cdot e^{im\theta}

Key derivation — the dot product after rotation depends only on relative position:

q~mk~n=qkˉei(mn)θ\tilde{q}_m \cdot \overline{\tilde{k}_n} = q \cdot \bar{k} \cdot e^{i(m-n)\theta}

At the implementation level, this means no matrix multiplication is needed — just element-wise cos/sin operations — which is very efficient.

Complex Representation
Step 1: Represent Q vector dimension pair on complex planeReImq = (q₂ᵢ, q₂ᵢ₊₁)View adjacent dimensions (q₂ᵢ, q₂ᵢ₊₁) as complex number q₂ᵢ + q₂ᵢ₊₁·j

Length Extrapolation

RoPE’s positional angles during training are bounded. When inference sequences exceed the training length, the high-frequency components’ angle values go beyond the range seen during training, causing abnormal attention scores and degraded model performance.

The mainstream solutions are:

  • NTK-aware scaling: Modify the frequency base b=bsd/(d2)b' = b \cdot s^{d/(d-2)} (where ss is the scaling factor), “compressing” high-frequency components back into the training range
  • YaRN (Yet another RoPE extensioN): A hybrid strategy — applies different scaling factors to different frequency components. High frequencies are kept unchanged (local relationships don’t need extrapolation), while low frequencies are scaled (long-range relationships need adaptation for longer contexts)

The visualization below shows how different scaling methods map out-of-range angles back into the training interval:

Max Angle Coverage per Dim Pair (train length=4096, current=4,096)No scaling — angles beyond training range (red area) may degrade performancedim pairi=0i=1i=2i=3i=4i=5i=6i=7i=8i=9i=10i=11i=12i=13i=14i=15Training range (0 ~ 4096)Angle in rangeAngle out of range

These methods can extend RoPE’s effective length from the training length by several to tens of times, enabling models like LLaMA to handle 100K+ token long documents.

Comparison Summary

MethodTypeTrainable ParamsExtrapolationCompute CostRepresentative Models
Sinusoidal
Absolute
None
Limited
Very Low
Transformer (original)
Learned
Absolute
L_max × d
None
Very Low
BERT, GPT-2
Shaw (Relative)
Relative
2K+1 biases
Medium
Medium
Transformer-XL
ALiBi
Relative
None
Strong
Very Low
BLOOM, MPT
RoPE
Relative
None
Medium→Strong
Low
LLaMA, GPT-NeoX, Qwen

Selection advice:

  • Training a new model from scratch → RoPE (current mainstream choice)
  • Need extreme simplicity + strong extrapolation → ALiBi
  • Legacy model compatibility → Follow the original scheme (Learned / Sinusoidal)