Multi-Model Collaboration: From Picking One to Using Many

Important distinction: MoE (Mixture of Experts) is an expert routing mechanism within a single model, while MoA (Mixture of Agents) is collaboration between multiple complete LLM models — the granularity is entirely different.

The previous articles all discussed “how to choose the single best model,” but there is another approach: why choose at all? Can we use multiple models simultaneously and synthesize their answers? This is the core idea behind Mixture of Agents (MoA).

Selection vs Synthesis

The traditional paradigm of Model Routing is Selection: given a request, the router picks the single most suitable model. This approach is efficient and cost-controllable, but may miss the strengths of other models.

The Synthesis paradigm is fundamentally different: call multiple models, collect all their answers, and merge them using some strategy. This approach costs more but leverages the complementarity between models, reducing single-model hallucination and bias.

Selection pursues efficiency; synthesis pursues quality. Real-world systems often seek a balance between the two: use selection for simple requests and synthesis for critical ones.

Council Mode: Parallel Generation, Centralized Synthesis

Council Mode (2026) is a representative architecture for multi-model collaboration. Its workflow consists of two phases:

Parallel Generation Phase: Multiple LLMs (called council members) independently generate answers, unaware of each other’s outputs.
Synthesis Phase: A synthesizer model collects all answers, analyzes their consensus and divergence, and produces the final answer.

Experiments show that Council Mode can reduce the hallucination rate by 35.9%. When council members’ answers are highly consistent, the synthesizer has higher confidence; when answers diverge significantly, the synthesizer flags uncertainty.

Three common synthesis strategies:

Merge: Extract the core information from each answer and construct a comprehensive answer incorporating all perspectives.
Majority Vote: Select the answer that appears most frequently — suitable for classification or multiple-choice tasks.
Best-of-N: Have the synthesizer score all answers and select the highest-quality one.

Hierarchical MoA: From Flat to Pyramid

Simple Council Mode uses a flat structure where all models are equal. More complex systems use a hierarchical structure, building decision trees or pyramids.

A hierarchical multi-agent approach runs a mini-council at each decision node, where each layer’s output becomes the next layer’s input. This enables handling complex tasks that require multi-step reasoning.

Pyramid MoA is an architecture proposed by Together AI, with layers that progressively narrow:

Layer 1: 5 general-purpose models generate initial answers.
Layer 2: 3 models synthesize Layer 1’s outputs.
Layer 3: 1 strongest model produces the final answer.

This design leverages decision-theoretic routing: early layers quickly filter out obviously wrong answers, while later layers focus on refinement. The system can decide whether to early-stop based on intermediate layer consensus, saving computational costs.

The advantage of hierarchical structure lies in flexibility: different layers can use models of different sizes (fast models at the bottom, high-quality models at the top), finding the optimal configuration between cost and quality.

Ensemble and Voting

The core of Ensemble Learning is diversity: if all models make the same mistakes, voting won’t help. An ideal ensemble should include:

Models with different architectures (e.g., GPT-4, Claude, Gemini).
Models of different sizes (large models excel at reasoning, small models at speed).
Models trained on different data (to reduce shared biases).

Three common voting mechanisms:

Majority Voting: Each model gets one vote; the answer with the most votes wins. Simple but ignores differences in model quality.
Weighted Voting: Assign weights $w_i$ based on each model’s historical accuracy; the answer score is $\sum_{i} w_i \cdot \mathbb{1}[\text{model}_i = \text{answer}]$ .
Best-of-N Selection: Use a judge model to score all answers and select the highest scorer.

Ensemble is most effective when models have divergent strengths: Model A excels at math, Model B at creative writing, Model C at factual queries. Through ensemble, the system can automatically leverage each model’s strengths across different tasks.

Cost and Diminishing Returns

The biggest challenge of multi-model collaboration is diminishing returns. Experimental data shows:

As the number of participating models increases, quality improvement shows rapidly diminishing marginal returns while costs grow linearly. The first few models typically bring the most significant quality gains, with each additional model contributing progressively less. For most applications, 2-3 models is the cost-effectiveness sweet spot.

Another hidden cost is latency. When calling multiple models in parallel, total latency is determined by the slowest model. If called sequentially, latencies accumulate. For real-time applications, this can be a dealbreaker.

Real-world systems need to weigh trade-offs based on the scenario:

High-value tasks (e.g., medical diagnosis, legal consultation): prioritize quality, can accept an ensemble of 5-10 models.
Medium tasks (e.g., content generation, code review): Council Mode with 2-3 models.
Low-value tasks (e.g., simple Q&A, format conversion): single-model routing is sufficient.

Summary

This article completes the final stop on the Model Routing learning path. Starting from the simplest classifier routing, we journeyed through cascade routing, hybrid strategies, online learning, and finally arrived at multi-model collaboration — from “pick the best one” to “use many together.”

The core advantages of multi-model collaboration are robustness and complementarity, at the cost of expense and complexity. As model capabilities improve and costs decline, this field is shifting from a “luxury” to a “standard configuration.”

The future trend is adaptive orchestration: systems dynamically decide — based on request characteristics, historical performance, and budget constraints — whether to use a single model or multiple models, parallel or sequential, flat or hierarchical. Model Routing is no longer a static configuration but a real-time intelligent decision.

The next decade of LLM applications is not about “which model is the strongest,” but about “how to combine the strengths of all models.”