Sentence Embeddings: From Token-Level to Semantic Retrieval
Updated 2026-04-12
Introduction: From Tokens to Sentences
BERT produces a contextual vector for each token, but many real-world tasks require sentence-level representations — semantic search, text clustering, duplicate detection. How do you compress a token sequence into a single fixed-length vector while preserving semantic information? This is the problem sentence embeddings aim to solve.
The core objective is intuitive: semantically similar sentences should have similar vectors. The standard way to measure “similar” is cosine similarity:
The Problem with Naive Averaging: Anisotropy
The most intuitive approach is to average all token vectors from BERT’s output. However, research has shown that BERT’s raw embedding space suffers from anisotropy — all sentence vectors cluster in a narrow cone-shaped region, causing any two sentences to have high cosine similarity (typically > 0.6), making it impossible to distinguish semantic differences.
The root cause: BERT’s MLM training optimizes token-level prediction and never explicitly learns “semantic distance between sentences.” Sentence vectors from naive mean pooling actually perform worse than simple averages of GloVe static vectors (Reimers & Gurevych, 2019).
The path forward: We need a dedicated training objective that teaches the model to produce discriminative sentence vectors.
Sentence-BERT: Siamese Network Architecture
Sentence-BERT (SBERT) addresses this problem through a Siamese (twin) network architecture. The core idea: pass two sentences through the same BERT encoder (weight-tied), apply mean pooling to each to get sentence vectors, then train using cosine similarity with a contrastive loss.
SBERT’s key advantage is inference efficiency: the traditional approach concatenates two sentences into BERT (cross-encoder), requiring forward passes to compute pairwise similarities for N sentences. SBERT can pre-compute all sentence vectors, then pairwise comparison only needs dot products — encodings + dot products (millisecond-level).
For a clustering task with 10,000 sentences: a cross-encoder takes ~65 hours, while SBERT takes ~5 seconds.
Contrastive Learning: InfoNCE Loss
Modern sentence embedding models widely use contrastive learning. The core idea: within a batch, each sentence has one positive pair (a semantically similar sentence), and all other sentences in the batch serve as negatives.
The InfoNCE loss function:
Here is the temperature parameter (typically 0.05-0.1), and , form a positive pair. Lower temperature means “sharper” contrast — the model more aggressively separates positives from negatives.
A key insight of contrastive learning: larger batch sizes mean more negatives, leading to better representations. This is why modern models like E5 and BGE train with massive batches (65,536+).
Semantic Similarity: Building Intuition
With high-quality sentence embeddings, computing semantic similarity between two sentences becomes simple vector arithmetic. The demo below shows cosine similarity for different sentence pairs — note the difference between paraphrases (high scores) and unrelated sentences (low scores):
Characteristics of high-quality embeddings: paraphrase pairs score > 0.85, semantically related but differently phrased pairs 0.6-0.8, completely unrelated pairs < 0.2.
Modern Sentence Embedding Models
After SBERT, the sentence embedding field evolved rapidly:
| Model | Year | Key Innovation | Dimensions |
|---|---|---|---|
| E5 (Microsoft) | 2022 | Weakly-supervised contrastive pre-training + instruction tuning | 1024 |
| BGE (BAAI) | 2023 | Chinese optimization + RetroMAE pre-training | 1024 |
| GTE (Alibaba) | 2023 | Multi-stage training + multi-task learning | 1024 |
| text-embedding-3 (OpenAI) | 2024 | Matryoshka representation learning, variable dimensions | 256-3072 |
Common trends: larger training data (millions to billions of pairs), multi-stage training (pre-train → contrastive learning → instruction tuning), longer context (8192 tokens).
Matryoshka Representation Learning (MRL): Models like OpenAI’s support “nested” dimensions — the same model can output 256/512/1024/3072-dimensional vectors, where lower dimensions retain most of the semantic information from higher dimensions, letting users flexibly trade off quality and cost.
Application: Retrieval-Augmented Generation (RAG)
One of the most important applications of sentence embeddings is RAG (Retrieval-Augmented Generation). RAG addresses core LLM limitations: knowledge cutoff dates, hallucinations, and lack of proprietary knowledge.
The core pipeline: pre-encode external documents as vectors stored in a database; when a user asks a question, first retrieve relevant documents, then concatenate retrieved results into the prompt for the LLM to generate an answer.
RAG effectiveness heavily depends on embedding quality — if the retrieval stage fails to find the right documents, even the most powerful LLM cannot help. This is why high-quality sentence embeddings are critical to the entire RAG system.
Summary
Sentence embeddings are the bridge between language understanding and information retrieval. From SBERT’s Siamese architecture to modern contrastive learning methods, the core goal remains: make semantically similar sentences close in vector space.
Key takeaways:
- Naive averaging of BERT outputs performs poorly (anisotropy problem)
- SBERT uses Siamese networks + mean pooling to efficiently produce sentence vectors
- Contrastive learning (InfoNCE) is the mainstream method for training sentence embeddings
- Modern models (E5, BGE) achieve high quality through large-scale data and multi-stage training
- RAG is the most important downstream application of sentence embeddings, combining retrieval with generation