Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

From Text to Vectors: Tokenization and Word Embeddings

From Text to Vectors: Tokenization and Word Embeddings

Updated 2026-04-12

Introduction

The first problem in NLP: computers cannot directly process text. Neural networks require numerical input — matrix multiplication, gradient descent, these operations act on floating-point numbers, not strings. Therefore, any NLP system must first solve a fundamental problem: how to convert text into numbers?

This problem has two layers:

  1. Tokenization: splitting a text string into discrete basic units (tokens)
  2. Embedding: mapping each token to a dense numerical vector

These two steps form the input pipeline of all language models, from Word2Vec to GPT-4. This article starts from intuition and progressively dives into both core techniques.

Intuitive Understanding: Characters vs Words vs Subwords

Suppose we want to process the phrase "unbelievably fast". There are three ways to split it:

GranularitySegmentation ResultToken CountProblem
Character-levelu, n, b, e, l, i, e, v, a, b, l, y, , f, a, s, t17Sequence too long, each character lacks semantics
Word-levelunbelievably, fast2Huge vocabulary (1M+ for English), cannot cover rare words
Subword-levelun, believ, ably, fast4Balances sequence length and semantics

Modern NLP almost universally uses subword tokenization. The core idea: keep common words intact (e.g., fast), split rare words into meaningful sub-segments (e.g., un + believ + ably).

Byte Pair Encoding (BPE) was introduced to NLP by Sennrich et al. (2016, “Neural Machine Translation of Rare Words with Subword Units”). Its core logic is remarkably simple:

  1. Start at the character level — each character is a token
  2. Count the frequency of all adjacent token pairs
  3. Merge the most frequent pair into a new token
  4. Repeat steps 2-3 until reaching the target vocabulary size

The visualization below shows the step-by-step BPE merge process:

Input text:
Step 0: Initial character splitCurrent Token Sequence (16 tokens)low lowest lowerPairFrequency"lo"3"ow"3" l"2"we"2"w "1"es"1"st"1"t "1Merge HistoryNext MergeReset

The Tokenizer Family

BPE is not the only subword tokenization algorithm. Here are four mainstream approaches:

AlgorithmRepresentative ModelsCore IdeaSpecial Markers
BPEGPT-2, GPT-4 (tiktoken)Bottom-up merging of most frequent pairsNo special prefix
WordPieceBERT, DistilBERTBottom-up merge, maximize likelihood## marks subword continuation
Unigram (SentencePiece)T5, LLaMA, QwenTop-down pruning, probabilistic model selects optimal split marks word boundary
TiktokenGPT-3.5, GPT-4BPE variant, byte-levelOperates directly on UTF-8 bytes

Key differences:

  • BPE builds vocabulary through greedy frequency-based merging. Tiktoken, used by the GPT series, is its efficient byte-level implementation.
  • WordPiece also merges, but selects pairs that “maximize the likelihood of the training data” rather than simple frequency.
  • SentencePiece/Unigram takes the opposite approach: starts from a large vocabulary and progressively removes low-probability subwords until the vocabulary shrinks to the target size. It is language-independent — no pre-tokenization needed, directly processing raw strings, making it especially friendly for Chinese, Japanese, and other languages without spaces.

Different tokenizers can produce dramatically different results for the same sentence:

Select sentence:
"The cat sat on the mat."GPT-2 BPEVocab: ~50kThe cat sat on the mat.7 tokensBERT WordPieceVocab: ~30kThecatsatonthemat.7 tokensSentencePieceVocab: ~32k▁The▁cat▁sat▁on▁the▁mat.7 tokensNote: ## prefix indicates WordPiece subword continuation; ▁ prefix indicates SentencePiece word boundary

Why does vocabulary size matter? A larger vocabulary means common words are more likely to be preserved as whole tokens (fewer tokens), but the Embedding layer parameters also grow (V×dV \times d, where VV is vocabulary size and dd is embedding dimension). LLaMA’s vocabulary grew from 32,000 to 128,256 in LLaMA-3, specifically to better cover multilingual text.

From Tokens to Vectors

With tokens in hand, we need to map each token to a numerical vector. The most naive approach is one-hot encoding: with a vocabulary of size VV, each token is represented by a vector of length VV with a 1 at the corresponding position and 0s elsewhere.

one-hot("cat")=[0,0,,1,,0]RV\text{one-hot}(\text{"cat"}) = [0, 0, \ldots, 1, \ldots, 0] \in \mathbb{R}^V

This has two fatal problems:

  1. Curse of dimensionality: GPT-4’s vocabulary is roughly 100,000 — each token becomes a 100,000-dimensional sparse vector
  2. No semantics: The one-hot vectors for cat and dog are orthogonal — cosine similarity is 0, completely masking the fact that both are animals

The Distributional Hypothesis (Harris, 1954) offers a better approach: “You shall know a word by the company it keeps.” If cat and dog frequently appear in similar contexts (“The ___ is sleeping”, “I fed my ___”), their vector representations should be similar.

This is the core idea of word embeddings: map each token to a low-dimensional dense vector space (typically 100-300 dimensions), so that semantically similar words are close in vector space.

Word2Vec

Word2Vec was proposed by Mikolov et al. (2013, “Efficient Estimation of Word Representations in Vector Space”) as a groundbreaking method. It has two architectures:

Skip-gram

Given a center word, predict its context words. The training objective maximizes:

max1Tt=1Tcjc,j0logp(wt+jwt)\max \frac{1}{T}\sum_{t=1}^{T} \sum_{-c \le j \le c, \, j \ne 0} \log p(w_{t+j} \mid w_t)

where TT is the total number of words in the corpus and cc is the context window size. The probability is computed via softmax:

p(wOwI)=exp(vwOvwI)w=1Wexp(vwvwI)p(w_O \mid w_I) = \frac{\exp(\mathbf{v}_{w_O}' \cdot \mathbf{v}_{w_I})}{\sum_{w=1}^{W} \exp(\mathbf{v}_w' \cdot \mathbf{v}_{w_I})}

where vwI\mathbf{v}_{w_I} is the input (center word) vector and vwO\mathbf{v}_{w_O}' is the output (context word) vector.

The visualization below shows Skip-gram’s sliding window and training process:

Objective: given center word, predict context wordsthe0quick1brown2fox3jumps4over5the6lazy7dog8Center wordTraining pairs:(brown, the)(brown, quick)(brown, fox)(brown, jumps)Simplified Skip-gram NetworkV dimsInput (one-hot)d dimsHidden (word vector)V dimsOutput (softmax)W(V x d)W'(d x V)PrevNext3 / 9

CBOW (Continuous Bag of Words)

The reverse of Skip-gram: given context words, predict the center word. CBOW averages the context word vectors as input — it trains faster but performs slightly worse on rare words.

Practical Optimization Tricks

The original softmax denominator requires summing over the entire vocabulary (WW can be tens to hundreds of thousands), which is computationally expensive. Two common acceleration methods:

  • Negative Sampling: Instead of computing the full softmax, only distinguish between “true context words” and “randomly sampled negatives”
  • Hierarchical Softmax: Organizes the vocabulary as a Huffman tree, reducing O(W)O(W) computation to O(logW)O(\log W)

After training, the hidden layer weight matrix is the word embedding we want. In this vector space, semantic relationships are encoded as vector arithmetic:

Select analogy:
kingqueenmanwomanking - man + woman ≈ queen (gender relation)royaltygendercountrycapitalanimalverbfoodsize

The classic analogy kingman+womanqueen\text{king} - \text{man} + \text{woman} \approx \text{queen} demonstrates that Word2Vec learns linear structure for semantic dimensions like gender and country-capital relations.

GloVe

GloVe (Global Vectors, Pennington et al. 2014) takes a different approach: directly leveraging global word-word co-occurrence statistics.

The core idea: if two words i,ji, j co-occur XijX_{ij} times in the corpus, their word vector dot product should approximately equal the log of the co-occurrence frequency. The objective function is:

J=i,j=1Vf(Xij)(wiTw~j+bi+b~jlogXij)2J = \sum_{i,j=1}^{V} f(X_{ij})\left(\mathbf{w}_i^T \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij}\right)^2

where f(Xij)f(X_{ij}) is a weighting function — high-frequency co-occurrences should not dominate training, and low-frequency ones should not be ignored.

GloVe vs Word2Vec:

  • Word2Vec is a local method: trains via sliding windows one at a time
  • GloVe is a global method: first builds the complete co-occurrence matrix, then performs matrix factorization
  • In practice, both produce comparable results, but GloVe training is easier to parallelize

The Limitations of Static Embeddings

Word2Vec and GloVe are both static embeddings: each word has only one fixed vector representation, regardless of context. This leads to a fundamental problem — polysemy.

Consider the word “bank”: in “river bank” it means a riverside, in “bank account” it means a financial institution. But in Word2Vec, both meanings are compressed into a single vector, losing disambiguating information.

Select polysemous word:|
Sentences & MeaningsI deposited money in the bank.Financial institutionWe sat on the river bank.RiversideThe bank approved my loan.Bank (lending)Embedding Spacefinancemoneydeposit"bank"rivershorewater"bank"loancreditapprove"bank"Different positions!Contextual embedding: the same word maps to different vectors in different contexts, near semantically similar words

This limitation directly drove the development of contextual embeddings. ELMo (2018) first used bidirectional LSTMs to generate context-dependent vectors for each word. BERT (2018) then took this idea further with Transformer Self-Attention — every token vector at every layer incorporates information from the entire sequence.

FeatureStatic Embeddings (Word2Vec/GloVe)Contextual Embeddings (BERT/GPT)
Same word’s vectorFixed, identicalVaries with context
Polysemy handlingAll meanings blended into one vectorDifferent meanings map to different positions
Training data usageLocal windows or co-occurrence matrixFull sequence context
Vector dimensions100-300768-4096
Parameter countMillionsHundreds of millions to trillions

Summary

This article covered the two core steps of the NLP input pipeline:

  1. Tokenization: Subword tokenization (BPE/WordPiece/SentencePiece) balances vocabulary size and semantic granularity. BPE builds vocabulary through greedy merging and has become the standard for the GPT series.

  2. Word Embedding: From one-hot’s curse of dimensionality to the distributional hypothesis’s dense vectors. Word2Vec learns word vectors from local context using shallow neural networks; GloVe works from global co-occurrence matrices. Both produce striking linear analogy structures.

  3. From Static to Contextual: Static embeddings cannot handle polysemy, driving the contextual embedding revolution from ELMo to BERT to GPT.

In subsequent articles, we will dive into BERT and GPT’s architecture design — understanding how they generate context-dependent token representations, and why the Decoder-only architecture ultimately won.