Factorization Machines and LLM Routing: From FM Theory to MF Router

The previous article walked through the full RouteLLM MF Router pipeline from training to deployment. This article goes one level deeper: RouteLLM’s Matrix Factorization router is, in essence, a special case of Rendle’s (2010) Factorization Machines (FM). Understanding the full FM framework not only deepens our grasp of why the MF router works, but also reveals a clear extension path—adding more feature dimensions to enhance routing capability.

1 The Problem: Prediction Under Extreme Sparsity

In many real-world scenarios, data is represented as high-dimensional feature vectors where almost all elements are zero. This extreme sparsity arises when we one-hot encode large categorical variables (users, items, words, etc.).

Consider 10,000 users and 5,000 movies. A rating event’s feature vector requires 15,000 dimensions (user one-hot + movie one-hot), but only 2 dimensions are non-zero—99.987% zeros. Worse still: you want to model the interaction between Alice and Star Trek. If Alice has never rated Star Trek, traditional models see zero evidence and give up. FM does not.

Formally, we want to learn $\hat{y}: \mathbb{R}^n \to T$ , mapping feature vectors $\mathbf{x} \in \mathbb{R}^n$ to target domain $T$ (regression: $T = \mathbb{R}$ ; classification: $T = \{+,-\}$ ). The feature vectors are highly sparse: the average number of non-zero elements $\bar{m}_D \ll n$ .

1.1 Running Example: Movie Ratings

Rendle uses a movie rating system as the running example. Users $U = \{\text{Alice}, \text{Bob}, \text{Charlie}\}$ , movies $I = \{\text{Titanic}, \text{Notting Hill}, \text{Star Wars}, \text{Star Trek}\}$ :

User	Movie	Rating
Alice	Titanic	5
Alice	Notting Hill	3
Alice	Star Wars	1
Bob	Star Wars	4
Bob	Star Trek	5
Charlie	Titanic	1
Charlie	Star Wars	5

Each record is converted into a sparse feature vector containing multiple one-hot groups. Total dimension ~18, but only ~6 non-zero values per row.

Core question: Can we predict how Alice would rate Star Trek? The co-occurrence $(x_\text{Alice}, x_\text{StarTrek})$ is zero in training data. A traditional model (e.g., polynomial SVM) would estimate $w_{\text{Alice},\text{StarTrek}} = 0$ . FM can do better.

2 The FM Model Equation

2.1 Full Equation (degree d=2)

$\hat{y}(\mathbf{x}) := w_0 + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n} \sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle \, x_i \, x_j$

Parameters: $w_0 \in \mathbb{R}$ (global bias), $\mathbf{w} \in \mathbb{R}^n$ (linear weights), $\mathbf{V} \in \mathbb{R}^{n \times k}$ (factor matrix), where $k$ is the factorization dimension.

Equivalent matrix form:

$\hat{y}(\mathbf{x}) = w_0 + \mathbf{w}^T\mathbf{x} + \frac{1}{2}\mathbf{x}^T \tilde{\mathbf{W}}\, \mathbf{x}$

where $\tilde{\mathbf{W}} = \mathbf{V}\mathbf{V}^T - \text{diag}(\mathbf{V}\mathbf{V}^T)$ is the low-rank factorized interaction matrix with zeroed diagonal (excluding self-interactions).

2.2 What Each Term Means

Parameter	Shape	Meaning	Example
$w_0$	scalar	Global bias/intercept	”Average movie rating = 3.5”
$w_i$	$\mathbb{R}^n$	First-order weight: feature $i$ ‘s individual effect	$w_\text{Alice} = +0.3$
$\mathbf{v}_i$	$\mathbb{R}^k$ (row $i$ of $\mathbf{V}$ )	Latent factor vector for feature $i$	$\mathbf{v}_\text{Alice} = [0.8, -0.3, 0.1]$
$\langle \mathbf{v}_i, \mathbf{v}_j \rangle$	scalar (derived)	Interaction strength between features $i$ and $j$	$\langle \mathbf{v}_\text{Alice}, \mathbf{v}_\text{SW} \rangle = -1.2$

The key innovation: factorized interactions. Traditional approaches learn an independent parameter $w_{ij}$ for each pair, requiring $n^2/2$ parameters. FM factorizes $\hat{w}_{ij} = \langle \mathbf{v}_i, \mathbf{v}_j \rangle$ , needing only $n \cdot k$ parameters. When $k \ll n$ , this dramatically reduces parameters and improves generalization.

2.3 The Hyperparameter $k$

$k$ is the factorization dimension, controlling the trade-off between expressiveness and generalization. For any positive definite interaction matrix $\mathbf{W}$ , a sufficiently large $k$ guarantees $\mathbf{V}$ exists such that $\mathbf{W} = \mathbf{V}\mathbf{V}^T$ . But in sparse settings, small $k$ (typically 8–64) generalizes better by forcing the model to discover shared latent structure.

3 Why Factorization Works—Alice and Star Trek

This is the most important insight in Rendle’s paper. Let’s trace the reasoning chain:

Problem: What is the interaction $\hat{w}_{\text{Alice},\text{ST}}$ between Alice and Star Trek? They never co-occur in training data—a direct model would set $w_{\text{Alice},\text{ST}} = 0$ .

FM factorizes $\hat{w}_{\text{Alice},\text{ST}} = \langle \mathbf{v}_\text{Alice}, \mathbf{v}_\text{ST} \rangle$ and leverages indirect evidence:

Bob and Charlie both like Star Wars: Training forces $\mathbf{v}_B$ and $\mathbf{v}_C$ to align similarly with $\mathbf{v}_\text{SW}$ .
Alice and Charlie disagree on Titanic: Alice gives 5, Charlie gives 1, so $\mathbf{v}_A$ and $\mathbf{v}_C$ differ on the romance dimension.
Star Trek ≈ Star Wars (from Bob’s view): Bob rates both highly (4 and 5), so $\mathbf{v}_\text{ST} \approx \mathbf{v}_\text{SW}$ .
Alice dislikes Star Wars (rating 1), and $\mathbf{v}_\text{ST} \approx \mathbf{v}_\text{SW}$ , so $\langle \mathbf{v}_A, \mathbf{v}_\text{ST} \rangle \approx$ low → Prediction: Alice will rate Star Trek low.

FM doesn’t need direct evidence for each (user, item) pair. By factorizing interactions into latent vectors, information propagates through shared factors. If A interacts with X, and B interacts with both X and Y, then A’s latent vector indirectly carries information about Y. This is why FM works under sparsity where SVM fails.

4 The O(kn) Computation Trick

Naive computation of the interaction term requires $O(kn^2)$ —iterating over all $\binom{n}{2}$ pairs. Rendle showed an elegant algebraic identity that reduces this to $O(kn)$ .

4.1 Derivation

Step 1: Expand the $i < j$ sum to a full double sum minus diagonal:

$\sum_{i=1}^{n} \sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j = \frac{1}{2} \left( \sum_{i=1}^{n} \sum_{j=1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j - \sum_{i=1}^{n} \langle \mathbf{v}_i, \mathbf{v}_i \rangle x_i^2 \right)$

Step 2: Expand the dot product, processing each factor dimension $f$ separately:

$= \frac{1}{2}\left( \sum_{i}\sum_{j}\sum_{f} v_{i,f} v_{j,f} x_i x_j - \sum_{i}\sum_{f} v_{i,f}^2 x_i^2 \right)$

Step 3 (key): The double sum separates into a product of single sums:

$\sum_{i}\sum_{j} v_{i,f} v_{j,f} x_i x_j = \left(\sum_{i} v_{i,f} x_i\right)^2$

This works because $v_{i,f} x_i$ depends only on $i$ and $v_{j,f} x_j$ depends only on $j$ .

Final result:

$\boxed{\text{Interaction term} = \frac{1}{2}\sum_{f=1}^{k}\left[\left(\sum_{i=1}^{n} v_{i,f} x_i\right)^2 - \sum_{i=1}^{n} v_{i,f}^2 x_i^2\right]}$

Complexity drops from $O(kn^2)$ to $O(kn)$ . Under sparsity, only non-zero $x_i$ need computation: $O(k\bar{m}_D)$ .

In matrix notation: define $\mathbf{s} = \mathbf{V}^T\mathbf{x} \in \mathbb{R}^k$ , $\mathbf{r} = (\mathbf{V} \odot \mathbf{V})^T(\mathbf{x} \odot \mathbf{x}) \in \mathbb{R}^k$ , then

$\text{Interaction term} = \frac{1}{2}(\|\mathbf{s}\|^2 - \mathbf{1}^T\mathbf{r})$

Geometric intuition: For each latent dimension $f$ , compute the weighted sum $s_f = \sum_i v_{i,f} x_i$ . Then $s_f^2$ contains all pairwise products (including self-interactions); subtracting the self-interactions $\sum_i v_{i,f}^2 x_i^2$ yields pure cross-interactions. Same idea as $(a+b+c)^2 = a^2 + b^2 + c^2 + 2(ab+ac+bc)$ .

4.2 A Concrete Example

Using $k=2$ to compute the FM prediction for Alice rating Titanic. Assume trained parameters:

Variable	$w_i$	$v_{i,1}$	$v_{i,2}$
Alice	+0.3	0.8	1.2
Titanic	+0.5	0.9	1.0

$w_0 = 3.5$ , only $x_\text{Alice} = 1$ and $x_\text{Titanic} = 1$ are non-zero.

Linear part: $w_0 + w_A + w_{TI} = 3.5 + 0.3 + 0.5 = 4.3$

Interaction part (using the efficient formula):

$f=1$ : $s_1 = 0.8 + 0.9 = 1.7$ , $s_1^2 = 2.89$ , self-interaction $= 0.64 + 0.81 = 1.45$
$f=2$ : $s_2 = 1.2 + 1.0 = 2.2$ , $s_2^2 = 4.84$ , self-interaction $= 1.44 + 1.00 = 2.44$

Interaction $= \frac{1}{2}[(2.89 - 1.45) + (4.84 - 2.44)] = \frac{1}{2}[1.44 + 2.40] = 1.92$

Verification: $\langle \mathbf{v}_A, \mathbf{v}_{TI}\rangle = 0.8 \times 0.9 + 1.2 \times 1.0 = 0.72 + 1.20 = 1.92$ ✓

$\hat{y} = 4.3 + 1.92 = 6.22$ , clipped to maximum rating of 5.

5 Gradients and Training

5.1 FM Gradients

$\frac{\partial}{\partial \theta}\hat{y}(\mathbf{x}) = \begin{cases} 1, & \text{if } \theta = w_0 \\ x_i, & \text{if } \theta = w_i \\ x_i \sum_{j=1}^{n} v_{j,f} x_j - v_{i,f} x_i^2, & \text{if } \theta = v_{i,f} \end{cases}$

Key observation: in the gradient for $v_{i,f}$ , the sum $\sum_{j} v_{j,f} x_j$ is independent of $i$ and was already computed during the forward pass (it is just $s_f$ ). So each gradient element costs $O(1)$ , and the full update costs $O(kn)$ , or $O(k\bar{m}_D)$ under sparsity.

Matrix form:

$\frac{\partial \hat{y}}{\partial \mathbf{V}} = \mathbf{x}\,\mathbf{s}^T - (\mathbf{x}\odot\mathbf{x})\cdot\mathbf{V}$

where $\mathbf{s} = \mathbf{V}^T\mathbf{x}$ is the cached vector from the forward pass.

5.2 SGD Training Loop

For each training sample $(\mathbf{x}, y)$ :

Forward pass: compute $\hat{y}(\mathbf{x})$ in $O(k\bar{m}_D)$ , cache $s_f = \sum_i v_{i,f} x_i$
Compute loss (MSE for regression, logit/hinge for classification)
Backward pass: for each non-zero $x_i$ , update $w_i$ and all $v_{i,f}$ using cached $s_f$
Apply L2 regularization: $\theta \leftarrow \theta - \eta(\nabla_\theta L + \lambda \theta)$

All operations per sample are $O(k\bar{m}_D)$ —the same as computing the prediction itself.

6 FM vs SVM

Polynomial SVM (degree $d=2$ ) and FM both model all pairwise interactions. The difference is how interaction weights are parameterized:

Property	Polynomial SVM	FM
Interaction weight for $(i,j)$	$w_{i,j}$ —independent free param	$\langle \mathbf{v}_i, \mathbf{v}_j \rangle$ —factorized
Number of interaction params	$n(n-1)/2$	$n \cdot k$
Estimation requirement	Needs direct $(i,j)$ co-occurrence	Indirect evidence suffices
Under sparsity	Unobserved pairs → $w_{i,j} = 0$	Can generalize to unobserved pairs
Optimization	Dual form (support vectors)	Primal form (direct SGD)
Prediction time	Depends on support vector count	$O(kn)$ —fixed

On Netflix data (100M+ ratings), SVM starts degrading around $k \approx 20$ (overfitting) while FM continues to improve. At $k = 128$ , FM achieves RMSE ~0.906 vs SVM ~0.981.

7 FM Subsumes MF, SVD++, PITF, FPMC

One of Rendle’s key contributions: FM can simulate many specialized models simply by choosing the right feature vector.

7.1 MF is FM with Only Two One-Hot Feature Groups

If the feature vector $\mathbf{x}$ contains only user one-hot + item one-hot (just 2 non-zero elements):

$n = |U \cup I|, \quad x_j = \begin{cases} 1 & \text{if } j = u \text{ or } j = i \\ 0 & \text{otherwise}\end{cases}$

Substituting into the FM equation, the linear terms collapse to $w_u + w_i$ and the interaction to $\langle \mathbf{v}_u, \mathbf{v}_i \rangle$ :

$\hat{y}(\mathbf{x}) = w_0 + w_u + w_i + \langle \mathbf{v}_u, \mathbf{v}_i \rangle$

This is exactly standard biased MF: global bias + user bias + item bias + user-item latent interaction.

Similarly, different feature groups simulate:

SVD++: add implicit feedback features
PITF: add tag one-hot
FPMC: add basket features

FM is not “just another model”—it is a unifying framework. You don’t need to derive new models for each task; just change the feature vector. This is precisely why FM is suitable for extending LLM routers.

8 RouteLLM MF Router: An FM Perspective

Now back to LLM routing. The previous article covered the MF Router implementation in detail; here we re-examine it through the FM lens.

8.1 Scoring Function Recap

The RouteLLM MF router defines a latent scoring function:

$\delta(M, q) = \mathbf{w}_2^T \left(\mathbf{v}_m \odot \left(\mathbf{W}_1^T \mathbf{v}_q + \mathbf{b}\right)\right)$

where $\mathbf{v}_m \in \mathbb{R}^{d_m}$ is a learnable model embedding, $\mathbf{v}_q \in \mathbb{R}^{1536}$ is the query embedding (from text-embedding-3-small), $\mathbf{W}_1$ is the projection matrix, $\mathbf{w}_2$ is the output weight, and $\odot$ denotes the Hadamard product.

Win probability via score difference: $P_\theta(\text{win}_s \mid q) = \sigma(\delta(M_s, q) - \delta(M_w, q))$ .

8.2 Layer-by-Layer Interpretation

Query projection: $\mathbf{h}_q = \mathbf{W}_1^T \mathbf{v}_q + \mathbf{b}$ projects the 1536-dim general text embedding into a $d_m$ -dim routing-relevant space. Intuitively, $\mathbf{W}_1$ learns to extract routing-relevant features—amplifying dimensions that distinguish “hard math” from “casual chat” while suppressing irrelevant semantic details.

Hadamard product: $\mathbf{h}_q \odot \mathbf{v}_m$ performs per-dimension “supply × demand” matching. Each dimension of $v_{m,f}$ represents a model capability axis; each dimension of $h_{q,f}$ represents the query’s demand for that capability. High values occur where model strength meets query need.

Linear readout: $\delta = \sum_f w_{2,f} \cdot h_{q,f} \cdot v_{m,f}$ is a weighted aggregation—learning which capability-demand matches matter most for predicting quality.

8.3 Relationship to Standard MF

Standard MF scoring is a pure dot product of two latent vectors: $\delta_\text{MF}(u, i) = \langle \mathbf{v}_u, \mathbf{v}_i \rangle$ .

RouteLLM’s scoring is extended MF:

Model embedding $\mathbf{v}_m$ plays the “user” vector role
Projected query $\mathbf{W}_1^T\mathbf{v}_q + \mathbf{b}$ plays the “item” vector role
Weight $\mathbf{w}_2$ adds non-uniform dimension weighting

Both use only two feature groups: model ID and query. From the FM perspective, this is an FM with just two fields.

8.4 Limitations

The MF router only sees model identity and query embedding. It cannot explicitly reason about:

Task type (math, coding, creative writing)
Complexity level (simple vs multi-step reasoning)
Prompt length
Language
Domain knowledge requirements

All such information must be implicitly captured by the query embedding. The router cannot model explicit interactions like “GPT-4 is especially strong on high-complexity math tasks” because task_type and complexity are not in the input features.

9 Bridging: From MF to FM Router

With both FM theory and RouteLLM MF understood, a clear extension path emerges.

9.1 RouteLLM MF = FM with Only Two Feature Groups

9.2 Formula Comparison

RouteLLM MF (2 feature groups):

$\hat{y} = w_0 + w_m + w_q + \langle \mathbf{v}_m, \text{proj}(\mathbf{v}_q)\rangle$

FM Router (N feature groups):

$\hat{y} = w_0 + \sum_i w_i x_i + \sum_{i}\sum_{j>i} \langle \mathbf{v}_i, \mathbf{v}_j\rangle\, x_i x_j$

The feature vector $\mathbf{x}$ can include: model ID (one-hot or embedding), query embedding (dense), task type (one-hot), complexity (ordinal or binned), prompt length (normalized), language (one-hot), and more. FM auto-models all second-order interactions in $O(kn)$ time.

9.3 What FM Can Do That MF Cannot

Interaction	MF (RouteLLM)	FM (Extended)
Model × Query	Explicit	Explicit
Model × Task Type	Implicit (via embedding)	Explicit
Model × Complexity	Implicit	Explicit
Task Type × Complexity	None	Explicit
Model × Language	Implicit	Explicit

Why do these interactions matter? “GPT-4 excels at math” is a known fact that MF can only implicitly learn from query embeddings. Patterns like “INT4 quantization preserves knowledge but degrades reasoning” can be directly modeled via model × quantization × task type interactions. FM makes each feature group an explicit, interpretable first-class citizen.

9.4 Key Takeaway

FM’s original vision was “just specify the input data and the same algorithm simulates multiple models.” The extension from MF to FM router perfectly embodies this: the algorithm stays the same; only the feature vector changes.

10 Summary

Stage	Content	Core Formula
FM Theory (Rendle 2010)	General factorized interaction framework, $O(kn)$	$\hat{y} = w_0 + \sum w_i x_i + \frac{1}{2}\sum_f [(\sum v_{i,f}x_i)^2 - \sum v_{i,f}^2 x_i^2]$
MF Router (RouteLLM)	FM with only (model, query) features	$\delta(M,q) = \mathbf{w}_2^T(\mathbf{v}_m \odot (\mathbf{W}_1^T \mathbf{v}_q + \mathbf{b}))$
FM Router (extension)	Add task_type, complexity, and more feature dims	Same FM formula, richer $\mathbf{x}$

FM provides a theoretically grounded, efficient framework for extending LLM routing. By moving from “model + query” to multi-dimensional features including task type, complexity, and language, we can make routing decisions more fine-grained and interpretable without changing the core algorithm.