Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Multi-Model Collaboration: From Picking One to Using Many

Multi-Model Collaboration: From Picking One to Using Many

Updated 2026-04-06

Important distinction: MoE (Mixture of Experts) is an expert routing mechanism within a single model, while MoA (Mixture of Agents) is collaboration between multiple complete LLM models — the granularity is entirely different.

The previous articles all discussed “how to choose the single best model,” but there is another approach: why choose at all? Can we use multiple models simultaneously and synthesize their answers? This is the core idea behind Mixture of Agents (MoA).

Selection vs Synthesis

The traditional paradigm of Model Routing is Selection: given a request, the router picks the single most suitable model. This approach is efficient and cost-controllable, but may miss the strengths of other models.

The Synthesis paradigm is fundamentally different: call multiple models, collect all their answers, and merge them using some strategy. This approach costs more but leverages the complementarity between models, reducing single-model hallucination and bias.

Two Philosophies: Select One vs Synthesize ManySelect One (Routing)Synthesize Many (MoA)QueryRouterGPT-4 Claude Llama 1 个回答Routing 假设:存在一个"最佳模型"✓ 成本最低(只调用一个模型)✓ 延迟最低(单次推理)✗ 质量受限于 router 的准确性和单个模型的能力上限

Selection pursues efficiency; synthesis pursues quality. Real-world systems often seek a balance between the two: use selection for simple requests and synthesis for critical ones.

Council Mode: Parallel Generation, Centralized Synthesis

Council Mode (2026) is a representative architecture for multi-model collaboration. Its workflow consists of two phases:

  1. Parallel Generation Phase: Multiple LLMs (called council members) independently generate answers, unaware of each other’s outputs.
  2. Synthesis Phase: A synthesizer model collects all answers, analyzes their consensus and divergence, and produces the final answer.

Experiments show that Council Mode can reduce the hallucination rate by 35.9%. When council members’ answers are highly consistent, the synthesizer has higher confidence; when answers diverge significantly, the synthesizer flags uncertainty.

Council Mode: Multi-LLM Parallel SynthesisMergeVoteBest-of-NQueryGPT-4oQuality: 92%Claude 3.5Quality: 90%Gemini 1.5Quality: 88%MergeSynthesis LayerSynthesized AnswerQuality: 96%MergeCouncil Mode: Synthesizes advantages of all answers into unified response. 35.9% hallucination reduction.Cost: 3 parallel models = 3× single model cost + synthesis costLatency: max(3 model latencies) + synthesis time ≈ slowest model × 1.2MoA ≠ MoE: MoA is multi-LLM collaboration, MoE is intra-model expert routing

Three common synthesis strategies:

  • Merge: Extract the core information from each answer and construct a comprehensive answer incorporating all perspectives.
  • Majority Vote: Select the answer that appears most frequently — suitable for classification or multiple-choice tasks.
  • Best-of-N: Have the synthesizer score all answers and select the highest-quality one.

Hierarchical MoA: From Flat to Pyramid

Simple Council Mode uses a flat structure where all models are equal. More complex systems use a hierarchical structure, building decision trees or pyramids.

A hierarchical multi-agent approach runs a mini-council at each decision node, where each layer’s output becomes the next layer’s input. This enables handling complex tasks that require multi-step reasoning.

Pyramid MoA is an architecture proposed by Together AI, with layers that progressively narrow:

  • Layer 1: 5 general-purpose models generate initial answers.
  • Layer 2: 3 models synthesize Layer 1’s outputs.
  • Layer 3: 1 strongest model produces the final answer.

This design leverages decision-theoretic routing: early layers quickly filter out obviously wrong answers, while later layers focus on refinement. The system can decide whether to early-stop based on intermediate layer consensus, saving computational costs.

Hierarchical MoA ArchitectureHieraMAS FlatPyramid MoA{t.layer} 1LLM-ALLM-BLLM-CLLM-D{t.layer} 2Agg-1Agg-2{t.layer} 3Final AggregatorHieraMAS: Intra-node LLM mixing + Inter-node communicationHierarchical MoA vs Flat MoAFlat MoA: All models parallel in one layer → aggregate. Simple but limited quality gain.Hierarchical MoA: Multi-layer progressive refinement. Each layer aggregates and passes to next for further improvement.Pyramid MoA: Decreasing layers (5→3→1), router decides when "good enough" for early termination.

The advantage of hierarchical structure lies in flexibility: different layers can use models of different sizes (fast models at the bottom, high-quality models at the top), finding the optimal configuration between cost and quality.

Ensemble and Voting

The core of Ensemble Learning is diversity: if all models make the same mistakes, voting won’t help. An ideal ensemble should include:

  • Models with different architectures (e.g., GPT-4, Claude, Gemini).
  • Models of different sizes (large models excel at reasoning, small models at speed).
  • Models trained on different data (to reduce shared biases).

Three common voting mechanisms:

  1. Majority Voting: Each model gets one vote; the answer with the most votes wins. Simple but ignores differences in model quality.
  2. Weighted Voting: Assign weights wiw_i based on each model’s historical accuracy; the answer score is iwi1[modeli=answer]\sum_{i} w_i \cdot \mathbb{1}[\text{model}_i = \text{answer}].
  3. Best-of-N Selection: Use a judge model to score all answers and select the highest scorer.
Ensemble Voting MethodsMajorityWeightedBest-of-NGPT-4oAClaudeAGeminiBResult: AMajority: 2 votes A vs 1 vote B → A wins

Ensemble is most effective when models have divergent strengths: Model A excels at math, Model B at creative writing, Model C at factual queries. Through ensemble, the system can automatically leverage each model’s strengths across different tasks.

Cost and Diminishing Returns

The biggest challenge of multi-model collaboration is diminishing returns. Experimental data shows:

As the number of participating models increases, quality improvement shows rapidly diminishing marginal returns while costs grow linearly. The first few models typically bring the most significant quality gains, with each additional model contributing progressively less. For most applications, 2-3 models is the cost-effectiveness sweet spot.

Model Count vs Cost & QualityQuality improvement diminishes, cost scales linearly — diminishing returns curveModel Count →Quality %Cost ×123456810← sweet spotKey Insight: 2-3 models are the sweet spot1→3: quality +6%, cost 3×. 3→10: quality only +2%, cost another 3.3×. Severely diminishing.

Another hidden cost is latency. When calling multiple models in parallel, total latency is determined by the slowest model. If called sequentially, latencies accumulate. For real-time applications, this can be a dealbreaker.

Real-world systems need to weigh trade-offs based on the scenario:

  • High-value tasks (e.g., medical diagnosis, legal consultation): prioritize quality, can accept an ensemble of 5-10 models.
  • Medium tasks (e.g., content generation, code review): Council Mode with 2-3 models.
  • Low-value tasks (e.g., simple Q&A, format conversion): single-model routing is sufficient.

Summary

This article completes the final stop on the Model Routing learning path. Starting from the simplest classifier routing, we journeyed through cascade routing, hybrid strategies, online learning, and finally arrived at multi-model collaboration — from “pick the best one” to “use many together.”

The core advantages of multi-model collaboration are robustness and complementarity, at the cost of expense and complexity. As model capabilities improve and costs decline, this field is shifting from a “luxury” to a “standard configuration.”

The future trend is adaptive orchestration: systems dynamically decide — based on request characteristics, historical performance, and budget constraints — whether to use a single model or multiple models, parallel or sequential, flat or hierarchical. Model Routing is no longer a static configuration but a real-time intelligent decision.

The next decade of LLM applications is not about “which model is the strongest,” but about “how to combine the strengths of all models.”