Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Factorization Machines and LLM Routing: From FM Theory to MF Router

Factorization Machines and LLM Routing: From FM Theory to MF Router

Updated 2026-04-22

The previous article walked through the full RouteLLM MF Router pipeline from training to deployment. This article goes one level deeper: RouteLLM’s Matrix Factorization router is, in essence, a special case of Rendle’s (2010) Factorization Machines (FM). Understanding the full FM framework not only deepens our grasp of why the MF router works, but also reveals a clear extension path—adding more feature dimensions to enhance routing capability.

1 The Problem: Prediction Under Extreme Sparsity

In many real-world scenarios, data is represented as high-dimensional feature vectors where almost all elements are zero. This extreme sparsity arises when we one-hot encode large categorical variables (users, items, words, etc.).

Consider 10,000 users and 5,000 movies. A rating event’s feature vector requires 15,000 dimensions (user one-hot + movie one-hot), but only 2 dimensions are non-zero—99.987% zeros. Worse still: you want to model the interaction between Alice and Star Trek. If Alice has never rated Star Trek, traditional models see zero evidence and give up. FM does not.

Formally, we want to learn y^:RnT\hat{y}: \mathbb{R}^n \to T, mapping feature vectors xRn\mathbf{x} \in \mathbb{R}^n to target domain TT (regression: T=RT = \mathbb{R}; classification: T={+,}T = \{+,-\}). The feature vectors are highly sparse: the average number of non-zero elements mˉDn\bar{m}_D \ll n.

1.1 Running Example: Movie Ratings

Rendle uses a movie rating system as the running example. Users U={Alice,Bob,Charlie}U = \{\text{Alice}, \text{Bob}, \text{Charlie}\}, movies I={Titanic,Notting Hill,Star Wars,Star Trek}I = \{\text{Titanic}, \text{Notting Hill}, \text{Star Wars}, \text{Star Trek}\}:

UserMovieRating
AliceTitanic5
AliceNotting Hill3
AliceStar Wars1
BobStar Wars4
BobStar Trek5
CharlieTitanic1
CharlieStar Wars5

Each record is converted into a sparse feature vector containing multiple one-hot groups. Total dimension ~18, but only ~6 non-zero values per row.

Core question: Can we predict how Alice would rate Star Trek? The co-occurrence (xAlice,xStarTrek)(x_\text{Alice}, x_\text{StarTrek}) is zero in training data. A traditional model (e.g., polynomial SVM) would estimate wAlice,StarTrek=0w_{\text{Alice},\text{StarTrek}} = 0. FM can do better.

2 The FM Model Equation

2.1 Full Equation (degree d=2)

y^(x):=w0+i=1nwixi+i=1nj=i+1nvi,vjxixj\hat{y}(\mathbf{x}) := w_0 + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n} \sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle \, x_i \, x_j

Parameters: w0Rw_0 \in \mathbb{R} (global bias), wRn\mathbf{w} \in \mathbb{R}^n (linear weights), VRn×k\mathbf{V} \in \mathbb{R}^{n \times k} (factor matrix), where kk is the factorization dimension.

Equivalent matrix form:

y^(x)=w0+wTx+12xTW~x\hat{y}(\mathbf{x}) = w_0 + \mathbf{w}^T\mathbf{x} + \frac{1}{2}\mathbf{x}^T \tilde{\mathbf{W}}\, \mathbf{x}

where W~=VVTdiag(VVT)\tilde{\mathbf{W}} = \mathbf{V}\mathbf{V}^T - \text{diag}(\mathbf{V}\mathbf{V}^T) is the low-rank factorized interaction matrix with zeroed diagonal (excluding self-interactions).

FM 模型方程解剖ŷ(x) = w₀ + Σᵢ wᵢxᵢ + Σᵢ Σⱼ₍ᵢ₎ ⟨vᵢ, vⱼ⟩ xᵢxⱼ全局偏置 w₀线性项 Σwᵢxᵢ交互项 Σ⟨vᵢ,vⱼ⟩xᵢxⱼΣᵢ Σⱼ₍ᵢ₎ ⟨vᵢ, vⱼ⟩ · xᵢ · xⱼ成对交互:通过隐向量的点积估计。这是 FM 的核心创新。传统方式ŵᵢⱼ = wᵢⱼ每对 (i,j) 一个独立参数 wᵢⱼ → n²/2 个参数FM 方式ŵᵢⱼ = ⟨vᵢ, vⱼ⟩ = Σ_f vᵢ_f · vⱼ_fŵᵢⱼ = ⟨vᵢ, vⱼ⟩ → n·k 个参数(k ≪ n)因子维度 k:k 小:参数少,泛化好,但表达力有限k 大:表达力强,但容易过拟合k=1k=8~64k=n

2.2 What Each Term Means

ParameterShapeMeaningExample
w0w_0scalarGlobal bias/intercept”Average movie rating = 3.5”
wiw_iRn\mathbb{R}^nFirst-order weight: feature ii‘s individual effectwAlice=+0.3w_\text{Alice} = +0.3
vi\mathbf{v}_iRk\mathbb{R}^k (row ii of V\mathbf{V})Latent factor vector for feature iivAlice=[0.8,0.3,0.1]\mathbf{v}_\text{Alice} = [0.8, -0.3, 0.1]
vi,vj\langle \mathbf{v}_i, \mathbf{v}_j \ranglescalar (derived)Interaction strength between features ii and jjvAlice,vSW=1.2\langle \mathbf{v}_\text{Alice}, \mathbf{v}_\text{SW} \rangle = -1.2

The key innovation: factorized interactions. Traditional approaches learn an independent parameter wijw_{ij} for each pair, requiring n2/2n^2/2 parameters. FM factorizes w^ij=vi,vj\hat{w}_{ij} = \langle \mathbf{v}_i, \mathbf{v}_j \rangle, needing only nkn \cdot k parameters. When knk \ll n, this dramatically reduces parameters and improves generalization.

2.3 The Hyperparameter kk

kk is the factorization dimension, controlling the trade-off between expressiveness and generalization. For any positive definite interaction matrix W\mathbf{W}, a sufficiently large kk guarantees V\mathbf{V} exists such that W=VVT\mathbf{W} = \mathbf{V}\mathbf{V}^T. But in sparse settings, small kk (typically 8–64) generalizes better by forcing the model to discover shared latent structure.

3 Why Factorization Works—Alice and Star Trek

This is the most important insight in Rendle’s paper. Let’s trace the reasoning chain:

Problem: What is the interaction w^Alice,ST\hat{w}_{\text{Alice},\text{ST}} between Alice and Star Trek? They never co-occur in training data—a direct model would set wAlice,ST=0w_{\text{Alice},\text{ST}} = 0.

FM factorizes w^Alice,ST=vAlice,vST\hat{w}_{\text{Alice},\text{ST}} = \langle \mathbf{v}_\text{Alice}, \mathbf{v}_\text{ST} \rangle and leverages indirect evidence:

  1. Bob and Charlie both like Star Wars: Training forces vB\mathbf{v}_B and vC\mathbf{v}_C to align similarly with vSW\mathbf{v}_\text{SW}.
  2. Alice and Charlie disagree on Titanic: Alice gives 5, Charlie gives 1, so vA\mathbf{v}_A and vC\mathbf{v}_C differ on the romance dimension.
  3. Star Trek ≈ Star Wars (from Bob’s view): Bob rates both highly (4 and 5), so vSTvSW\mathbf{v}_\text{ST} \approx \mathbf{v}_\text{SW}.
  4. Alice dislikes Star Wars (rating 1), and vSTvSW\mathbf{v}_\text{ST} \approx \mathbf{v}_\text{SW}, so vA,vST\langle \mathbf{v}_A, \mathbf{v}_\text{ST} \rangle \approx low → Prediction: Alice will rate Star Trek low.
FM 的间接推理:Alice 与 Star Trek5314515?AliceBobCharlieTitanicNotting HillStar WarsStar Trek高分 (≥4)低分 (≤2)Alice → Star Trek = ?推理步骤 1/4Bob 和 Charlie 都喜欢 Star Wars → vB 和 vC 与 vSW 在隐空间对齐下一步

FM doesn’t need direct evidence for each (user, item) pair. By factorizing interactions into latent vectors, information propagates through shared factors. If A interacts with X, and B interacts with both X and Y, then A’s latent vector indirectly carries information about Y. This is why FM works under sparsity where SVM fails.

4 The O(kn) Computation Trick

Naive computation of the interaction term requires O(kn2)O(kn^2)—iterating over all (n2)\binom{n}{2} pairs. Rendle showed an elegant algebraic identity that reduces this to O(kn)O(kn).

4.1 Derivation

Step 1: Expand the i<ji < j sum to a full double sum minus diagonal:

i=1nj=i+1nvi,vjxixj=12(i=1nj=1nvi,vjxixji=1nvi,vixi2)\sum_{i=1}^{n} \sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j = \frac{1}{2} \left( \sum_{i=1}^{n} \sum_{j=1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j - \sum_{i=1}^{n} \langle \mathbf{v}_i, \mathbf{v}_i \rangle x_i^2 \right)

Step 2: Expand the dot product, processing each factor dimension ff separately:

=12(ijfvi,fvj,fxixjifvi,f2xi2)= \frac{1}{2}\left( \sum_{i}\sum_{j}\sum_{f} v_{i,f} v_{j,f} x_i x_j - \sum_{i}\sum_{f} v_{i,f}^2 x_i^2 \right)

Step 3 (key): The double sum separates into a product of single sums:

ijvi,fvj,fxixj=(ivi,fxi)2\sum_{i}\sum_{j} v_{i,f} v_{j,f} x_i x_j = \left(\sum_{i} v_{i,f} x_i\right)^2

This works because vi,fxiv_{i,f} x_i depends only on ii and vj,fxjv_{j,f} x_j depends only on jj.

Final result:

Interaction term=12f=1k[(i=1nvi,fxi)2i=1nvi,f2xi2]\boxed{\text{Interaction term} = \frac{1}{2}\sum_{f=1}^{k}\left[\left(\sum_{i=1}^{n} v_{i,f} x_i\right)^2 - \sum_{i=1}^{n} v_{i,f}^2 x_i^2\right]}

Complexity drops from O(kn2)O(kn^2) to O(kn)O(kn). Under sparsity, only non-zero xix_i need computation: O(kmˉD)O(k\bar{m}_D).

In matrix notation: define s=VTxRk\mathbf{s} = \mathbf{V}^T\mathbf{x} \in \mathbb{R}^k, r=(VV)T(xx)Rk\mathbf{r} = (\mathbf{V} \odot \mathbf{V})^T(\mathbf{x} \odot \mathbf{x}) \in \mathbb{R}^k, then

Interaction term=12(s21Tr)\text{Interaction term} = \frac{1}{2}(\|\mathbf{s}\|^2 - \mathbf{1}^T\mathbf{r})

Geometric intuition: For each latent dimension ff, compute the weighted sum sf=ivi,fxis_f = \sum_i v_{i,f} x_i. Then sf2s_f^2 contains all pairwise products (including self-interactions); subtracting the self-interactions ivi,f2xi2\sum_i v_{i,f}^2 x_i^2 yields pure cross-interactions. Same idea as (a+b+c)2=a2+b2+c2+2(ab+ac+bc)(a+b+c)^2 = a^2 + b^2 + c^2 + 2(ab+ac+bc).

4.2 A Concrete Example

Using k=2k=2 to compute the FM prediction for Alice rating Titanic. Assume trained parameters:

Variablewiw_ivi,1v_{i,1}vi,2v_{i,2}
Alice+0.30.81.2
Titanic+0.50.91.0

w0=3.5w_0 = 3.5, only xAlice=1x_\text{Alice} = 1 and xTitanic=1x_\text{Titanic} = 1 are non-zero.

Linear part: w0+wA+wTI=3.5+0.3+0.5=4.3w_0 + w_A + w_{TI} = 3.5 + 0.3 + 0.5 = 4.3

Interaction part (using the efficient formula):

  • f=1f=1: s1=0.8+0.9=1.7s_1 = 0.8 + 0.9 = 1.7, s12=2.89s_1^2 = 2.89, self-interaction =0.64+0.81=1.45= 0.64 + 0.81 = 1.45
  • f=2f=2: s2=1.2+1.0=2.2s_2 = 1.2 + 1.0 = 2.2, s22=4.84s_2^2 = 4.84, self-interaction =1.44+1.00=2.44= 1.44 + 1.00 = 2.44

Interaction =12[(2.891.45)+(4.842.44)]=12[1.44+2.40]=1.92= \frac{1}{2}[(2.89 - 1.45) + (4.84 - 2.44)] = \frac{1}{2}[1.44 + 2.40] = 1.92

Verification: vA,vTI=0.8×0.9+1.2×1.0=0.72+1.20=1.92\langle \mathbf{v}_A, \mathbf{v}_{TI}\rangle = 0.8 \times 0.9 + 1.2 \times 1.0 = 0.72 + 1.20 = 1.92

y^=4.3+1.92=6.22\hat{y} = 4.3 + 1.92 = 6.22, clipped to maximum rating of 5.

5 Gradients and Training

5.1 FM Gradients

θy^(x)={1,if θ=w0xi,if θ=wixij=1nvj,fxjvi,fxi2,if θ=vi,f\frac{\partial}{\partial \theta}\hat{y}(\mathbf{x}) = \begin{cases} 1, & \text{if } \theta = w_0 \\ x_i, & \text{if } \theta = w_i \\ x_i \sum_{j=1}^{n} v_{j,f} x_j - v_{i,f} x_i^2, & \text{if } \theta = v_{i,f} \end{cases}

Key observation: in the gradient for vi,fv_{i,f}, the sum jvj,fxj\sum_{j} v_{j,f} x_j is independent of ii and was already computed during the forward pass (it is just sfs_f). So each gradient element costs O(1)O(1), and the full update costs O(kn)O(kn), or O(kmˉD)O(k\bar{m}_D) under sparsity.

Matrix form:

y^V=xsT(xx)V\frac{\partial \hat{y}}{\partial \mathbf{V}} = \mathbf{x}\,\mathbf{s}^T - (\mathbf{x}\odot\mathbf{x})\cdot\mathbf{V}

where s=VTx\mathbf{s} = \mathbf{V}^T\mathbf{x} is the cached vector from the forward pass.

5.2 SGD Training Loop

For each training sample (x,y)(\mathbf{x}, y):

  1. Forward pass: compute y^(x)\hat{y}(\mathbf{x}) in O(kmˉD)O(k\bar{m}_D), cache sf=ivi,fxis_f = \sum_i v_{i,f} x_i
  2. Compute loss (MSE for regression, logit/hinge for classification)
  3. Backward pass: for each non-zero xix_i, update wiw_i and all vi,fv_{i,f} using cached sfs_f
  4. Apply L2 regularization: θθη(θL+λθ)\theta \leftarrow \theta - \eta(\nabla_\theta L + \lambda \theta)

All operations per sample are O(kmˉD)O(k\bar{m}_D)—the same as computing the prediction itself.

6 FM vs SVM

Polynomial SVM (degree d=2d=2) and FM both model all pairwise interactions. The difference is how interaction weights are parameterized:

PropertyPolynomial SVMFM
Interaction weight for (i,j)(i,j)wi,jw_{i,j}—independent free paramvi,vj\langle \mathbf{v}_i, \mathbf{v}_j \rangle—factorized
Number of interaction paramsn(n1)/2n(n-1)/2nkn \cdot k
Estimation requirementNeeds direct (i,j)(i,j) co-occurrenceIndirect evidence suffices
Under sparsityUnobserved pairs → wi,j=0w_{i,j} = 0Can generalize to unobserved pairs
OptimizationDual form (support vectors)Primal form (direct SGD)
Prediction timeDepends on support vector countO(kn)O(kn)—fixed

On Netflix data (100M+ ratings), SVM starts degrading around k20k \approx 20 (overfitting) while FM continues to improve. At k=128k = 128, FM achieves RMSE ~0.906 vs SVM ~0.981.

7 FM Subsumes MF, SVD++, PITF, FPMC

One of Rendle’s key contributions: FM can simulate many specialized models simply by choosing the right feature vector.

7.1 MF is FM with Only Two One-Hot Feature Groups

If the feature vector x\mathbf{x} contains only user one-hot + item one-hot (just 2 non-zero elements):

n=UI,xj={1if j=u or j=i0otherwisen = |U \cup I|, \quad x_j = \begin{cases} 1 & \text{if } j = u \text{ or } j = i \\ 0 & \text{otherwise}\end{cases}

Substituting into the FM equation, the linear terms collapse to wu+wiw_u + w_i and the interaction to vu,vi\langle \mathbf{v}_u, \mathbf{v}_i \rangle:

y^(x)=w0+wu+wi+vu,vi\hat{y}(\mathbf{x}) = w_0 + w_u + w_i + \langle \mathbf{v}_u, \mathbf{v}_i \rangle

This is exactly standard biased MF: global bias + user bias + item bias + user-item latent interaction.

Similarly, different feature groups simulate:

  • SVD++: add implicit feedback features
  • PITF: add tag one-hot
  • FPMC: add basket features

FM is not “just another model”—it is a unifying framework. You don’t need to derive new models for each task; just change the feature vector. This is precisely why FM is suitable for extending LLM routers.

8 RouteLLM MF Router: An FM Perspective

Now back to LLM routing. The previous article covered the MF Router implementation in detail; here we re-examine it through the FM lens.

8.1 Scoring Function Recap

The RouteLLM MF router defines a latent scoring function:

δ(M,q)=w2T(vm(W1Tvq+b))\delta(M, q) = \mathbf{w}_2^T \left(\mathbf{v}_m \odot \left(\mathbf{W}_1^T \mathbf{v}_q + \mathbf{b}\right)\right)

where vmRdm\mathbf{v}_m \in \mathbb{R}^{d_m} is a learnable model embedding, vqR1536\mathbf{v}_q \in \mathbb{R}^{1536} is the query embedding (from text-embedding-3-small), W1\mathbf{W}_1 is the projection matrix, w2\mathbf{w}_2 is the output weight, and \odot denotes the Hadamard product.

Win probability via score difference: Pθ(winsq)=σ(δ(Ms,q)δ(Mw,q))P_\theta(\text{win}_s \mid q) = \sigma(\delta(M_s, q) - \delta(M_w, q)).

8.2 Layer-by-Layer Interpretation

Query projection: hq=W1Tvq+b\mathbf{h}_q = \mathbf{W}_1^T \mathbf{v}_q + \mathbf{b} projects the 1536-dim general text embedding into a dmd_m-dim routing-relevant space. Intuitively, W1\mathbf{W}_1 learns to extract routing-relevant features—amplifying dimensions that distinguish “hard math” from “casual chat” while suppressing irrelevant semantic details.

Hadamard product: hqvm\mathbf{h}_q \odot \mathbf{v}_m performs per-dimension “supply × demand” matching. Each dimension of vm,fv_{m,f} represents a model capability axis; each dimension of hq,fh_{q,f} represents the query’s demand for that capability. High values occur where model strength meets query need.

Linear readout: δ=fw2,fhq,fvm,f\delta = \sum_f w_{2,f} \cdot h_{q,f} \cdot v_{m,f} is a weighted aggregation—learning which capability-demand matches matter most for predicting quality.

8.3 Relationship to Standard MF

Standard MF scoring is a pure dot product of two latent vectors: δMF(u,i)=vu,vi\delta_\text{MF}(u, i) = \langle \mathbf{v}_u, \mathbf{v}_i \rangle.

RouteLLM’s scoring is extended MF:

  • Model embedding vm\mathbf{v}_m plays the “user” vector role
  • Projected query W1Tvq+b\mathbf{W}_1^T\mathbf{v}_q + \mathbf{b} plays the “item” vector role
  • Weight w2\mathbf{w}_2 adds non-uniform dimension weighting

Both use only two feature groups: model ID and query. From the FM perspective, this is an FM with just two fields.

8.4 Limitations

The MF router only sees model identity and query embedding. It cannot explicitly reason about:

  • Task type (math, coding, creative writing)
  • Complexity level (simple vs multi-step reasoning)
  • Prompt length
  • Language
  • Domain knowledge requirements

All such information must be implicitly captured by the query embedding. The router cannot model explicit interactions like “GPT-4 is especially strong on high-complexity math tasks” because task_type and complexity are not in the input features.

9 Bridging: From MF to FM Router

With both FM theory and RouteLLM MF understood, a clear extension path emerges.

9.1 RouteLLM MF = FM with Only Two Feature Groups

MF → FM:从两组特征到多维路由特征向量对比交互项对比RouteLLM MF 路由器仅 2 组特征模型 ID查询嵌入 (1536-d)FMFM 路由器(扩展)N 组特征模型 ID查询嵌入任务类型复杂度Prompt 长度语言...MF: MF 把所有信息压缩进查询嵌入的一次投影FM: FM 让每组特征成为显式的一等公民算法不变,变的只是特征向量

9.2 Formula Comparison

RouteLLM MF (2 feature groups):

y^=w0+wm+wq+vm,proj(vq)\hat{y} = w_0 + w_m + w_q + \langle \mathbf{v}_m, \text{proj}(\mathbf{v}_q)\rangle

FM Router (N feature groups):

y^=w0+iwixi+ij>ivi,vjxixj\hat{y} = w_0 + \sum_i w_i x_i + \sum_{i}\sum_{j>i} \langle \mathbf{v}_i, \mathbf{v}_j\rangle\, x_i x_j

The feature vector x\mathbf{x} can include: model ID (one-hot or embedding), query embedding (dense), task type (one-hot), complexity (ordinal or binned), prompt length (normalized), language (one-hot), and more. FM auto-models all second-order interactions in O(kn)O(kn) time.

9.3 What FM Can Do That MF Cannot

InteractionMF (RouteLLM)FM (Extended)
Model × QueryExplicitExplicit
Model × Task TypeImplicit (via embedding)Explicit
Model × ComplexityImplicitExplicit
Task Type × ComplexityNoneExplicit
Model × LanguageImplicitExplicit

Why do these interactions matter? “GPT-4 excels at math” is a known fact that MF can only implicitly learn from query embeddings. Patterns like “INT4 quantization preserves knowledge but degrades reasoning” can be directly modeled via model × quantization × task type interactions. FM makes each feature group an explicit, interpretable first-class citizen.

9.4 Key Takeaway

FM’s original vision was “just specify the input data and the same algorithm simulates multiple models.” The extension from MF to FM router perfectly embodies this: the algorithm stays the same; only the feature vector changes.

10 Summary

StageContentCore Formula
FM Theory (Rendle 2010)General factorized interaction framework, O(kn)O(kn)y^=w0+wixi+12f[(vi,fxi)2vi,f2xi2]\hat{y} = w_0 + \sum w_i x_i + \frac{1}{2}\sum_f [(\sum v_{i,f}x_i)^2 - \sum v_{i,f}^2 x_i^2]
MF Router (RouteLLM)FM with only (model, query) featuresδ(M,q)=w2T(vm(W1Tvq+b))\delta(M,q) = \mathbf{w}_2^T(\mathbf{v}_m \odot (\mathbf{W}_1^T \mathbf{v}_q + \mathbf{b}))
FM Router (extension)Add task_type, complexity, and more feature dimsSame FM formula, richer x\mathbf{x}

FM provides a theoretically grounded, efficient framework for extending LLM routing. By moving from “model + query” to multi-dimensional features including task type, complexity, and language, we can make routing decisions more fine-grained and interpretable without changing the core algorithm.