Interpreting Leaderboards and Model Selection
Updated 2026-04-14
Opening: Dozens of Models on the Leaderboard — Which One Should I Pick?
Open any LLM leaderboard and you will see dozens of models with dozens of score columns: MMLU 92.3%, HumanEval 91.7%, Arena ELO 1320… And the rankings differ across leaderboards. Model A ranks third on Chatbot Arena but drops to tenth on the Open LLM Leaderboard.
This makes practical model selection difficult: Which leaderboard should I consult for my specific use case? Which scores actually matter? And once I pick a model, how do I verify it truly works well on my task?
This article is the final piece in the LLM Evaluation and Benchmark learning path. The preceding six articles covered evaluation methodology, the design principles of various benchmarks, and their limitations. Now we bring all that knowledge to bear on one practical problem — model selection.
Comparing Major Leaderboards
There are currently four highly influential LLM leaderboards, each with a distinct positioning, methodology, and set of applicable scenarios.
Chatbot Arena (Arena)
Chatbot Arena (now rebranded as Arena) was launched by LMSYS in April 2023 and is currently the most widely recognized leaderboard for overall model capability.
- Method: Real users submit the same question to two anonymous models, blind-evaluate which answer is better, and votes are converted into ELO ratings
- Scale: Millions of cumulative human votes covering 100+ models
- Categories: Supports viewing rankings by Overall, Coding, Hard Prompts, Math, Creative Writing, and more
- Strength: Directly reflects real user preferences and is resistant to gaming — each battle prompt comes from a random user question
- ELO range: Top models typically score between 1250-1500+, enabling precise ranking of inter-model differences
Open LLM Leaderboard V2
The Open LLM Leaderboard, maintained by Hugging Face, focuses on open-source / open-weight models.
- Method: Automated evaluation using lm-evaluation-harness; V2 uses 6 benchmarks: MMLU-Pro, GPQA, MuSR, MATH Lvl 5, BBH, and IFEval
- Scale: Tracks thousands of open-source models with over 13,000 community likes
- Strength: Fully reproducible — anyone can obtain the same results using the same configuration; Apache 2.0 licensed
- Limitation: Covers only open-source models; static benchmarks carry data contamination risk
LiveBench
LiveBench is a dynamic benchmark specifically designed to combat data contamination.
- Method: New questions are drawn monthly from the latest math competitions, academic papers, and news events, covering six categories: math, code, reasoning, language comprehension, instruction following, and data analysis
- Strength: Questions are sourced from information published after model training cutoff dates, making them naturally immune to data contamination; fully automated scoring with no LLM-as-Judge bias
- Use case: Verifying a model’s “true capability” — if a model scores 95% on MMLU but performs mediocrely on LiveBench, data contamination is likely
Artificial Analysis
Artificial Analysis is the only leaderboard that simultaneously covers quality, performance, and price across three dimensions.
- Dimensions: Intelligence Index (quality), Price (per million tokens), Output Speed (tokens/s), Latency / TTFT (time to first token), Context Window
- Scale: Tracks 475+ models (including roughly 216 open-weight models) from 20+ providers
- Use case: When you need to make quality-cost trade-offs, this is the most practical reference — it can directly tell you “how much intelligence you get per dollar”
Quick Reference Table for the Four Major Leaderboards
| Leaderboard | Core Method | Coverage | Best For |
|---|---|---|---|
| Chatbot Arena | ELO blind evaluation | Closed-source + Open-source | Overall conversational ability ranking |
| Open LLM Leaderboard | lm-eval automation | Open-source only | Open-source model selection |
| LiveBench | Dynamic question generation | Closed-source + Open-source | Verifying true capability, anti-contamination |
| Artificial Analysis | Multi-dimensional benchmarking | Closed-source + Open-source | Quality-price-speed trade-offs |
Leaderboard Pitfalls
Leaderboards are useful references, but blindly trusting leaderboard scores is the most common mistake in model selection. Here are several pitfalls you must watch out for.
Pitfall 1: Large Ranking Discrepancies Across Leaderboards
The same model can rank very differently across leaderboards. The reason is straightforward: they measure different things. Chatbot Arena measures user preference (influenced by prompt distribution), the Open LLM Leaderboard measures academic benchmark scores (susceptible to data contamination), and Artificial Analysis also factors in speed and price dimensions.
Countermeasure: Clarify which dimensions you care about and consult the corresponding leaderboard. Do not expect a single leaderboard to answer all questions.
Pitfall 2: Benchmark Gaming
Some model teams optimize specifically for certain benchmarks, including:
- Mixing benchmark data into instruction tuning
- Fine-tuning prompt templates to match evaluation formats
- Selectively reporting favorable scores
Countermeasure: Cross-validate — if a model drastically outperforms on one benchmark but is mediocre on similar benchmarks, be skeptical. Dynamic benchmarks like LiveBench are naturally resistant to this.
Pitfall 3: Metrics Disconnected from Real-World Experience
High benchmark scores do not equal great real-world experience. A model scoring 95% on MMLU may perform poorly on your specific Chinese legal Q&A scenario. Reasons include:
- The task distribution covered by the benchmark differs from your use case
- Benchmarks typically test short text, while your scenario may require long contexts
- Language distribution imbalance — most benchmarks are primarily in English
Countermeasure: Use benchmarks as a screening tool; the final selection must be validated on your own data through a mini evaluation.
Pitfall 4: Sample Bias in Arena
Chatbot Arena voters are predominantly from technical communities (developers, researchers), and their questions skew toward technical topics and English. This means Arena rankings have limited reference value for “general Chinese conversation” or “customer service for non-technical users.”
Countermeasure: Check Arena’s category-specific rankings (e.g., Hard Prompts, Math, Coding) and find the sub-leaderboard most relevant to your scenario.
Scenario-Based Selection Framework: The Four-Step Method
To deal with leaderboard complexity, we propose a practical four-step selection framework:
Step 1: Define Your Task Type
What is your core task? Different tasks have different “gold-standard benchmarks”:
| Task Type | Core Benchmark | Why This One |
|---|---|---|
| General conversation | Chatbot Arena + MT-Bench | Most directly reflects conversational experience |
| Code generation | SWE-bench + LiveCodeBench | Real project-level issues + dynamic anti-contamination |
| Reasoning / Math | MMLU-Pro + GPQA Diamond + MATH | Coverage from undergraduate to PhD level |
| Agent / Tool calling | BFCL v3 + WebArena | Function calling + end-to-end agent capability |
Step 2: Determine Your Constraints
Technical constraints often matter more than model capability in determining your choice:
- Latency requirements: Real-time (<1s TTFT) -> exclude large models or opt for streaming; interactive (1-10s) -> most models are viable
- Deployment mode: Cloud API, local deployment, or hybrid?
- Hardware limitations: GPU memory for local deployment determines available model sizes and quantization precision
- Budget: APIs charge per token, and pricing across models can differ by up to 100x
Step 3: Choose a Benchmark Combination
Based on the results of the first two steps, select 3-5 of the most relevant benchmarks. Core principles:
- Include at least one dynamic benchmark (LiveBench or LiveCodeBench) to guard against data contamination
- Include at least one benchmark directly corresponding to your task type
- If cost matters, incorporate Artificial Analysis cost-effectiveness data
Step 4: Mini Evaluation
Leaderboard scores are for “screening”; the final decision must be validated on your own data. A mini evaluation does not need to be large-scale:
- Prepare 50-100 representative samples — covering your typical input distribution
- Design evaluation criteria — define what constitutes a “good answer” for your scenario
- A/B compare candidate models — using LLM-as-Judge or human blind evaluation
- Record latency and cost — measure in your actual deployment environment
Rule of thumb: 50 samples are enough to distinguish clear capability gaps; if two models are hard to tell apart on 50 samples, they are genuinely close for your scenario — pick the cheaper or faster one.
Deep Dive: Chatbot Arena Evaluation Mechanism
Among all leaderboards, Chatbot Arena has the greatest influence on industry model selection. Here we examine its inner workings and known limitations in detail.
Blind Evaluation Mechanism
The core of Chatbot Arena is the Anonymous Battle:
- A user submits a prompt
- The system randomly selects two models (the user does not know which)
- Both models respond simultaneously
- The user chooses: A is better / B is better / Tie / Both Bad
- After voting, model identities are revealed
This mechanism ensures evaluation fairness — users cannot favor a model due to brand preference.
ELO Rating System
Voting results are converted to ELO scores via the Bradley-Terry model. The core idea is:
- Defeating a high-rated model earns more points
- Losing to a low-rated model costs more points
- After sufficiently many battles, the ELO stabilizes
Chatbot Arena uses bootstrap methods to compute confidence intervals; typically, thousands of votes are needed to establish statistically significant ranking differences among top models.
Category Rankings
Arena provides multiple sub-leaderboards to help you find the ranking most relevant to your scenario:
- Overall: Aggregate ranking across all votes
- Hard Prompts: Only counts prompts tagged as “difficult,” reflecting a model’s ability to handle complex requests
- Coding: Rankings for code-related prompts
- Math: Rankings for mathematical reasoning prompts
- Creative Writing: Rankings for creative writing prompts
- Longer Query: Performance on long-prompt scenarios
Known Limitations
- User population bias: Voters are predominantly English-speaking technical users, with weaker coverage of Chinese and other languages
- Prompt distribution bias: Technical and creative prompts are overrepresented; everyday conversational prompts are underrepresented
- Length preference: Research shows users tend to prefer longer answers — even when content quality is the same. Arena is continually refining debiasing methods
- Formatting preference: Answers using Markdown formatting (lists, bold text) are more likely to receive votes
- Cannot decompose capabilities: A single ELO score compresses information across all dimensions, making it unsuitable for fine-grained capability assessment
Practical advice: When consulting Arena rankings, prioritize the category-specific ranking that matches your scenario rather than Overall. If your scenario involves Chinese conversation, Arena rankings have limited reference value — running your own mini evaluation is recommended.
Small Model Selection: Local Deployment Scenarios
For local deployment scenarios, model selection is more constrained by hardware and requires more careful evaluation.
Major Small Model Families
| Model Family | Representative Sizes | Characteristics | Suitable Scenarios |
|---|---|---|---|
| Qwen2.5 | 0.5B / 3B / 7B / 14B / 72B | Strong bilingual (Chinese-English) performance; excellent at code and math | Chinese-first scenarios |
| Llama 3.1/3.3 | 8B / 70B | One of the strongest open-source English models; richest community ecosystem | English-first + rich tooling |
| Gemma 2 | 2B / 9B / 27B | High training data quality; good inference efficiency | General tasks + resource-constrained environments |
| Phi-4 | 14B | Data quality driven; strong reasoning and math capabilities | Reasoning-intensive tasks |
| Mistral / Mixtral | 7B / 8x7B | Architectural innovations (sliding window attention, MoE) | Long context + high throughput |
Choosing a Quantization Scheme
Quantization is nearly mandatory for local deployment. The trade-offs for different quantization levels:
| Quantization Level | VRAM Requirement (7B) | Typical Accuracy Loss | Suitable Scenarios |
|---|---|---|---|
| FP16 | ~14 GB | Baseline | First choice when VRAM is sufficient |
| INT8 | ~7 GB | <1% | Best balance of performance and accuracy |
| INT4 (GPTQ/AWQ) | ~4 GB | 1-5% | Primary option when VRAM is limited |
| INT4 (GGUF Q4_K_M) | ~4.5 GB | 2-6% | CPU + llama.cpp deployment |
| INT3/INT2 | ~2-3 GB | 5-15% | Extremely resource-constrained, experimental only |
Key insight: Quantization affects different tasks differently — code and math tasks are most sensitive, while conversational tasks are relatively tolerant. For detailed quantization accuracy evaluation methods, see Impact of Optimization on Accuracy.
A Note on Intel GPUs
If you are using an Intel Arc GPU or integrated GPU (iGPU), you can perform INT4 quantized inference through the OpenVINO toolchain. Suitable model sizes are 7B-13B. For specific deployment optimization details, refer to the intel-igpu-inference path.
Interactive Decision Tool: Find Your Model
Theory covered, use the interactive tool below to turn it into action. Answer a few questions about your scenario and get personalized benchmark combination and model range recommendations.
模型选型决策树
回答几个问题,获取个性化推荐
第 1 步 / 共 3 步
你的核心任务是什么?
Tip: This decision tree provides an initial direction. After receiving a recommendation, be sure to follow the four-step method above and run a mini evaluation to confirm your final choice.
From “Pick One Model” to “Dynamically Select Models”
By this point, you know how to select a suitable model based on leaderboards and scenario constraints. But in real production environments, a more powerful strategy is — don’t pick just one; use them all.
This is the idea behind Model Routing:
- Simple requests (e.g., “translate this sentence”) -> route to a small model or lightweight API (e.g., GPT-4o-mini) for fast response and low cost
- Complex requests (e.g., “analyze the security vulnerabilities in this code and provide a fix”) -> route to a large model or flagship API (e.g., Claude Opus) to ensure quality
- Routing strategies can be based on classifiers, cascade verification, or even online RL learning
Benchmark data plays a key role in routing systems:
- Each model’s scores across different benchmarks form a capability profile
- Capability profile + task classifier = the data foundation for routing rules
- Artificial Analysis cost-effectiveness data is used directly for cost optimization
This is the evolution from static selection to dynamic selection. If you chose “hybrid deployment” in the decision tree above, model routing is your next stop.
Continue reading -> Model Routing: Intelligent Model Selection and Hybrid Inference path, evolving from “pick one model” to “let the system automatically choose the optimal model.”
Recommended Learning Resources
Leaderboard Direct Links
- Chatbot Arena: The most authoritative overall capability ranking, based on real user anonymous battles
- Open LLM Leaderboard V2: The standard reference for open-source models, powered by lm-eval-harness
- LiveBench: Dynamic benchmark, naturally resistant to data contamination
- Artificial Analysis: The only leaderboard covering quality, speed, and price simultaneously
Further Reading
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023) — The design paper for Chatbot Arena, detailing the ELO rating system and LLM-as-Judge methodology
- From Crowdsourced Data to High-Quality Benchmarks: Arena Hard and BenchBuilder Pipeline (Li et al., 2024) — How Arena-Hard distills high-discrimination evaluation sets from Arena data
Path Summary
This article is the final piece in the LLM Evaluation and Benchmark Deep Dive learning path. A review of the full path:
- Benchmark Landscape and Evaluation Methodology — Establishing a global classification framework
- Knowledge and Reasoning Benchmarks Deep Dive — MMLU-Pro, GPQA, MATH, and more
- Code Benchmarks Deep Dive — HumanEval, SWE-bench, LiveCodeBench
- Agent Benchmarks Deep Dive — BFCL, WebArena, GAIA
- LLM Benchmark Standard Set — The standard evaluation suite in technical reports
- Impact of Optimization on Accuracy — Quantization accuracy evaluation methodology
- Interpreting Leaderboards and Model Selection (this article) — From leaderboards to practical model selection