Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Sentence Embeddings: From Token-Level to Semantic Retrieval

Sentence Embeddings: From Token-Level to Semantic Retrieval

Updated 2026-04-12

Introduction: From Tokens to Sentences

BERT produces a contextual vector for each token, but many real-world tasks require sentence-level representations — semantic search, text clustering, duplicate detection. How do you compress a token sequence into a single fixed-length vector while preserving semantic information? This is the problem sentence embeddings aim to solve.

The core objective is intuitive: semantically similar sentences should have similar vectors. The standard way to measure “similar” is cosine similarity:

sim(u,v)=uvuv\text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|}

The Problem with Naive Averaging: Anisotropy

The most intuitive approach is to average all token vectors from BERT’s output. However, research has shown that BERT’s raw embedding space suffers from anisotropy — all sentence vectors cluster in a narrow cone-shaped region, causing any two sentences to have high cosine similarity (typically > 0.6), making it impossible to distinguish semantic differences.

The root cause: BERT’s MLM training optimizes token-level prediction and never explicitly learns “semantic distance between sentences.” Sentence vectors from naive mean pooling actually perform worse than simple averages of GloVe static vectors (Reimers & Gurevych, 2019).

The path forward: We need a dedicated training objective that teaches the model to produce discriminative sentence vectors.

Sentence-BERT: Siamese Network Architecture

Sentence-BERT (SBERT) addresses this problem through a Siamese (twin) network architecture. The core idea: pass two sentences through the same BERT encoder (weight-tied), apply mean pooling to each to get sentence vectors, then train using cosine similarity with a contrastive loss.

1. Input Two Sentences
Sentence A"Deep learning is fun"Sentence B"Machine learning is interesting"Two sentences to compare are fed into the network

SBERT’s key advantage is inference efficiency: the traditional approach concatenates two sentences into BERT (cross-encoder), requiring O(N2)O(N^2) forward passes to compute pairwise similarities for N sentences. SBERT can pre-compute all sentence vectors, then pairwise comparison only needs dot products — O(N)O(N) encodings + O(N2)O(N^2) dot products (millisecond-level).

For a clustering task with 10,000 sentences: a cross-encoder takes ~65 hours, while SBERT takes ~5 seconds.

Contrastive Learning: InfoNCE Loss

Modern sentence embedding models widely use contrastive learning. The core idea: within a batch, each sentence has one positive pair (a semantically similar sentence), and all other sentences in the batch serve as negatives.

The InfoNCE loss function:

Li=logexp(sim(zi,zj)/τ)k=12N1[ki]exp(sim(zi,zk)/τ)\mathcal{L}_i = -\log \frac{\exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_j) / \tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \ne i]} \exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_k) / \tau)}

Here τ\tau is the temperature parameter (typically 0.05-0.1), and zi\mathbf{z}_i, zj\mathbf{z}_j form a positive pair. Lower temperature means “sharper” contrast — the model more aggressively separates positives from negatives.

Contrastive Learning: In-Batch Positives & NegativesTraining Step: 0/5012345InfoNCE Loss: 2.08S1S1'S2S2'S3S3'S4S4'Positive pair (attract)Negative pair (repel)Training Progress012345

A key insight of contrastive learning: larger batch sizes mean more negatives, leading to better representations. This is why modern models like E5 and BGE train with massive batches (65,536+).

Semantic Similarity: Building Intuition

With high-quality sentence embeddings, computing semantic similarity between two sentences becomes simple vector arithmetic. The demo below shows cosine similarity for different sentence pairs — note the difference between paraphrases (high scores) and unrelated sentences (low scores):

Sentence Similarity CalculatorThe food at this restaurant ...This eatery serves delicious...0.94The weather is nice todayIt is sunny outside0.88I enjoy runningExercise is good for health0.72The cat sleeps on the couchStock market surged today0.08Deep learning requires lots ...Neural networks need massive...0.91He is writing codeThe moon orbits the earth0.05cos(θ) = (A·B) / (|A||B|)θABSentence A:The food at this restaurant is grea...Sentence B:This eatery serves delicious mealsCosine Similarity0.94Highly similar0.01.0

Characteristics of high-quality embeddings: paraphrase pairs score > 0.85, semantically related but differently phrased pairs 0.6-0.8, completely unrelated pairs < 0.2.

Modern Sentence Embedding Models

After SBERT, the sentence embedding field evolved rapidly:

ModelYearKey InnovationDimensions
E5 (Microsoft)2022Weakly-supervised contrastive pre-training + instruction tuning1024
BGE (BAAI)2023Chinese optimization + RetroMAE pre-training1024
GTE (Alibaba)2023Multi-stage training + multi-task learning1024
text-embedding-3 (OpenAI)2024Matryoshka representation learning, variable dimensions256-3072

Common trends: larger training data (millions to billions of pairs), multi-stage training (pre-train → contrastive learning → instruction tuning), longer context (8192 tokens).

Matryoshka Representation Learning (MRL): Models like OpenAI’s support “nested” dimensions — the same model can output 256/512/1024/3072-dimensional vectors, where lower dimensions retain most of the semantic information from higher dimensions, letting users flexibly trade off quality and cost.

Application: Retrieval-Augmented Generation (RAG)

One of the most important applications of sentence embeddings is RAG (Retrieval-Augmented Generation). RAG addresses core LLM limitations: knowledge cutoff dates, hallucinations, and lack of proprietary knowledge.

The core pipeline: pre-encode external documents as vectors stored in a database; when a user asks a question, first retrieve relevant documents, then concatenate retrieved results into the prompt for the LLM to generate an answer.

1. Query Encoding
User Query"What is the attentionmechanism in Transformers?"EmbeddingModelQuery Vectorq ∈ ℝ⁷⁶⁸Encode the user query into a vector representation

RAG effectiveness heavily depends on embedding quality — if the retrieval stage fails to find the right documents, even the most powerful LLM cannot help. This is why high-quality sentence embeddings are critical to the entire RAG system.

Summary

Sentence embeddings are the bridge between language understanding and information retrieval. From SBERT’s Siamese architecture to modern contrastive learning methods, the core goal remains: make semantically similar sentences close in vector space.

Key takeaways:

  • Naive averaging of BERT outputs performs poorly (anisotropy problem)
  • SBERT uses Siamese networks + mean pooling to efficiently produce sentence vectors
  • Contrastive learning (InfoNCE) is the mainstream method for training sentence embeddings
  • Modern models (E5, BGE) achieve high quality through large-scale data and multi-stage training
  • RAG is the most important downstream application of sentence embeddings, combining retrieval with generation