Interpreting Leaderboards and Model Selection

Opening: Dozens of Models on the Leaderboard — Which One Should I Pick?

Open any LLM leaderboard and you will see dozens of models with dozens of score columns: MMLU 92.3%, HumanEval 91.7%, Arena ELO 1320… And the rankings differ across leaderboards. Model A ranks third on Chatbot Arena but drops to tenth on the Open LLM Leaderboard.

This makes practical model selection difficult: Which leaderboard should I consult for my specific use case? Which scores actually matter? And once I pick a model, how do I verify it truly works well on my task?

This article is the final piece in the LLM Evaluation and Benchmark learning path. The preceding six articles covered evaluation methodology, the design principles of various benchmarks, and their limitations. Now we bring all that knowledge to bear on one practical problem — model selection.

Comparing Major Leaderboards

There are currently four highly influential LLM leaderboards, each with a distinct positioning, methodology, and set of applicable scenarios.

Chatbot Arena (Arena)

Chatbot Arena (now rebranded as Arena) was launched by LMSYS in April 2023 and is currently the most widely recognized leaderboard for overall model capability.

Method: Real users submit the same question to two anonymous models, blind-evaluate which answer is better, and votes are converted into ELO ratings
Scale: Millions of cumulative human votes covering 100+ models
Categories: Supports viewing rankings by Overall, Coding, Hard Prompts, Math, Creative Writing, and more
Strength: Directly reflects real user preferences and is resistant to gaming — each battle prompt comes from a random user question
ELO range: Top models typically score between 1250-1500+, enabling precise ranking of inter-model differences

Open LLM Leaderboard V2

The Open LLM Leaderboard, maintained by Hugging Face, focuses on open-source / open-weight models.

Method: Automated evaluation using lm-evaluation-harness; V2 uses 6 benchmarks: MMLU-Pro, GPQA, MuSR, MATH Lvl 5, BBH, and IFEval
Scale: Tracks thousands of open-source models with over 13,000 community likes
Strength: Fully reproducible — anyone can obtain the same results using the same configuration; Apache 2.0 licensed
Limitation: Covers only open-source models; static benchmarks carry data contamination risk

LiveBench

LiveBench is a dynamic benchmark specifically designed to combat data contamination.

Method: New questions are drawn monthly from the latest math competitions, academic papers, and news events, covering six categories: math, code, reasoning, language comprehension, instruction following, and data analysis
Strength: Questions are sourced from information published after model training cutoff dates, making them naturally immune to data contamination; fully automated scoring with no LLM-as-Judge bias
Use case: Verifying a model’s “true capability” — if a model scores 95% on MMLU but performs mediocrely on LiveBench, data contamination is likely

Artificial Analysis

Artificial Analysis is the only leaderboard that simultaneously covers quality, performance, and price across three dimensions.

Dimensions: Intelligence Index (quality), Price (per million tokens), Output Speed (tokens/s), Latency / TTFT (time to first token), Context Window
Scale: Tracks 475+ models (including roughly 216 open-weight models) from 20+ providers
Use case: When you need to make quality-cost trade-offs, this is the most practical reference — it can directly tell you “how much intelligence you get per dollar”

Quick Reference Table for the Four Major Leaderboards

Leaderboard	Core Method	Coverage	Best For
Chatbot Arena	ELO blind evaluation	Closed-source + Open-source	Overall conversational ability ranking
Open LLM Leaderboard	lm-eval automation	Open-source only	Open-source model selection
LiveBench	Dynamic question generation	Closed-source + Open-source	Verifying true capability, anti-contamination
Artificial Analysis	Multi-dimensional benchmarking	Closed-source + Open-source	Quality-price-speed trade-offs

Leaderboard Pitfalls

Leaderboards are useful references, but blindly trusting leaderboard scores is the most common mistake in model selection. Here are several pitfalls you must watch out for.

Pitfall 1: Large Ranking Discrepancies Across Leaderboards

The same model can rank very differently across leaderboards. The reason is straightforward: they measure different things. Chatbot Arena measures user preference (influenced by prompt distribution), the Open LLM Leaderboard measures academic benchmark scores (susceptible to data contamination), and Artificial Analysis also factors in speed and price dimensions.

Countermeasure: Clarify which dimensions you care about and consult the corresponding leaderboard. Do not expect a single leaderboard to answer all questions.

Pitfall 2: Benchmark Gaming

Some model teams optimize specifically for certain benchmarks, including:

Mixing benchmark data into instruction tuning
Fine-tuning prompt templates to match evaluation formats
Selectively reporting favorable scores

Countermeasure: Cross-validate — if a model drastically outperforms on one benchmark but is mediocre on similar benchmarks, be skeptical. Dynamic benchmarks like LiveBench are naturally resistant to this.

Pitfall 3: Metrics Disconnected from Real-World Experience

High benchmark scores do not equal great real-world experience. A model scoring 95% on MMLU may perform poorly on your specific Chinese legal Q&A scenario. Reasons include:

The task distribution covered by the benchmark differs from your use case
Benchmarks typically test short text, while your scenario may require long contexts
Language distribution imbalance — most benchmarks are primarily in English

Countermeasure: Use benchmarks as a screening tool; the final selection must be validated on your own data through a mini evaluation.

Pitfall 4: Sample Bias in Arena

Chatbot Arena voters are predominantly from technical communities (developers, researchers), and their questions skew toward technical topics and English. This means Arena rankings have limited reference value for “general Chinese conversation” or “customer service for non-technical users.”

Countermeasure: Check Arena’s category-specific rankings (e.g., Hard Prompts, Math, Coding) and find the sub-leaderboard most relevant to your scenario.

Scenario-Based Selection Framework: The Four-Step Method

To deal with leaderboard complexity, we propose a practical four-step selection framework:

Step 1: Define Your Task Type

What is your core task? Different tasks have different “gold-standard benchmarks”:

Task Type	Core Benchmark	Why This One
General conversation	Chatbot Arena + MT-Bench	Most directly reflects conversational experience
Code generation	SWE-bench + LiveCodeBench	Real project-level issues + dynamic anti-contamination
Reasoning / Math	MMLU-Pro + GPQA Diamond + MATH	Coverage from undergraduate to PhD level
Agent / Tool calling	BFCL v3 + WebArena	Function calling + end-to-end agent capability

Step 2: Determine Your Constraints

Technical constraints often matter more than model capability in determining your choice:

Latency requirements: Real-time (<1s TTFT) -> exclude large models or opt for streaming; interactive (1-10s) -> most models are viable
Deployment mode: Cloud API, local deployment, or hybrid?
Hardware limitations: GPU memory for local deployment determines available model sizes and quantization precision
Budget: APIs charge per token, and pricing across models can differ by up to 100x

Step 3: Choose a Benchmark Combination

Based on the results of the first two steps, select 3-5 of the most relevant benchmarks. Core principles:

Include at least one dynamic benchmark (LiveBench or LiveCodeBench) to guard against data contamination
Include at least one benchmark directly corresponding to your task type
If cost matters, incorporate Artificial Analysis cost-effectiveness data

Step 4: Mini Evaluation

Leaderboard scores are for “screening”; the final decision must be validated on your own data. A mini evaluation does not need to be large-scale:

Prepare 50-100 representative samples — covering your typical input distribution
Design evaluation criteria — define what constitutes a “good answer” for your scenario
A/B compare candidate models — using LLM-as-Judge or human blind evaluation
Record latency and cost — measure in your actual deployment environment

Rule of thumb: 50 samples are enough to distinguish clear capability gaps; if two models are hard to tell apart on 50 samples, they are genuinely close for your scenario — pick the cheaper or faster one.

Deep Dive: Chatbot Arena Evaluation Mechanism

Among all leaderboards, Chatbot Arena has the greatest influence on industry model selection. Here we examine its inner workings and known limitations in detail.

The core of Chatbot Arena is the Anonymous Battle:

A user submits a prompt
The system randomly selects two models (the user does not know which)
Both models respond simultaneously
The user chooses: A is better / B is better / Tie / Both Bad
After voting, model identities are revealed

This mechanism ensures evaluation fairness — users cannot favor a model due to brand preference.

ELO Rating System

Voting results are converted to ELO scores via the Bradley-Terry model. The core idea is:

Defeating a high-rated model earns more points
Losing to a low-rated model costs more points
After sufficiently many battles, the ELO stabilizes

Chatbot Arena uses bootstrap methods to compute confidence intervals; typically, thousands of votes are needed to establish statistically significant ranking differences among top models.

Category Rankings

Arena provides multiple sub-leaderboards to help you find the ranking most relevant to your scenario:

Overall: Aggregate ranking across all votes
Hard Prompts: Only counts prompts tagged as “difficult,” reflecting a model’s ability to handle complex requests
Coding: Rankings for code-related prompts
Math: Rankings for mathematical reasoning prompts
Creative Writing: Rankings for creative writing prompts
Longer Query: Performance on long-prompt scenarios

Known Limitations

User population bias: Voters are predominantly English-speaking technical users, with weaker coverage of Chinese and other languages
Prompt distribution bias: Technical and creative prompts are overrepresented; everyday conversational prompts are underrepresented
Length preference: Research shows users tend to prefer longer answers — even when content quality is the same. Arena is continually refining debiasing methods
Formatting preference: Answers using Markdown formatting (lists, bold text) are more likely to receive votes
Cannot decompose capabilities: A single ELO score compresses information across all dimensions, making it unsuitable for fine-grained capability assessment

Practical advice: When consulting Arena rankings, prioritize the category-specific ranking that matches your scenario rather than Overall. If your scenario involves Chinese conversation, Arena rankings have limited reference value — running your own mini evaluation is recommended.

Small Model Selection: Local Deployment Scenarios

For local deployment scenarios, model selection is more constrained by hardware and requires more careful evaluation.

Major Small Model Families

Model Family	Representative Sizes	Characteristics	Suitable Scenarios
Qwen2.5	0.5B / 3B / 7B / 14B / 72B	Strong bilingual (Chinese-English) performance; excellent at code and math	Chinese-first scenarios
Llama 3.1/3.3	8B / 70B	One of the strongest open-source English models; richest community ecosystem	English-first + rich tooling
Gemma 2	2B / 9B / 27B	High training data quality; good inference efficiency	General tasks + resource-constrained environments
Phi-4	14B	Data quality driven; strong reasoning and math capabilities	Reasoning-intensive tasks
Mistral / Mixtral	7B / 8x7B	Architectural innovations (sliding window attention, MoE)	Long context + high throughput

Choosing a Quantization Scheme

Quantization is nearly mandatory for local deployment. The trade-offs for different quantization levels:

Quantization Level	VRAM Requirement (7B)	Typical Accuracy Loss	Suitable Scenarios
FP16	~14 GB	Baseline	First choice when VRAM is sufficient
INT8	~7 GB	<1%	Best balance of performance and accuracy
INT4 (GPTQ/AWQ)	~4 GB	1-5%	Primary option when VRAM is limited
INT4 (GGUF Q4_K_M)	~4.5 GB	2-6%	CPU + llama.cpp deployment
INT3/INT2	~2-3 GB	5-15%	Extremely resource-constrained, experimental only

Key insight: Quantization affects different tasks differently — code and math tasks are most sensitive, while conversational tasks are relatively tolerant. For detailed quantization accuracy evaluation methods, see Impact of Optimization on Accuracy.

A Note on Intel GPUs

If you are using an Intel Arc GPU or integrated GPU (iGPU), you can perform INT4 quantized inference through the OpenVINO toolchain. Suitable model sizes are 7B-13B. For specific deployment optimization details, refer to the intel-igpu-inference path.

Interactive Decision Tool: Find Your Model

Theory covered, use the interactive tool below to turn it into action. Answer a few questions about your scenario and get personalized benchmark combination and model range recommendations.

模型选型决策树

回答几个问题，获取个性化推荐

第 1 步 / 共 3 步

你的核心任务是什么？

Tip: This decision tree provides an initial direction. After receiving a recommendation, be sure to follow the four-step method above and run a mini evaluation to confirm your final choice.

From “Pick One Model” to “Dynamically Select Models”

By this point, you know how to select a suitable model based on leaderboards and scenario constraints. But in real production environments, a more powerful strategy is — don’t pick just one; use them all.

This is the idea behind Model Routing:

Simple requests (e.g., “translate this sentence”) -> route to a small model or lightweight API (e.g., GPT-4o-mini) for fast response and low cost
Complex requests (e.g., “analyze the security vulnerabilities in this code and provide a fix”) -> route to a large model or flagship API (e.g., Claude Opus) to ensure quality
Routing strategies can be based on classifiers, cascade verification, or even online RL learning

Benchmark data plays a key role in routing systems:

Each model’s scores across different benchmarks form a capability profile
Capability profile + task classifier = the data foundation for routing rules
Artificial Analysis cost-effectiveness data is used directly for cost optimization

This is the evolution from static selection to dynamic selection. If you chose “hybrid deployment” in the decision tree above, model routing is your next stop.

Continue reading -> Model Routing: Intelligent Model Selection and Hybrid Inference path, evolving from “pick one model” to “let the system automatically choose the optimal model.”

Recommended Learning Resources

Leaderboard Direct Links

Chatbot Arena: The most authoritative overall capability ranking, based on real user anonymous battles
Open LLM Leaderboard V2: The standard reference for open-source models, powered by lm-eval-harness
LiveBench: Dynamic benchmark, naturally resistant to data contamination
Artificial Analysis: The only leaderboard covering quality, speed, and price simultaneously

Path Summary

This article is the final piece in the LLM Evaluation and Benchmark Deep Dive learning path. A review of the full path:

Benchmark Landscape and Evaluation Methodology — Establishing a global classification framework
Knowledge and Reasoning Benchmarks Deep Dive — MMLU-Pro, GPQA, MATH, and more
Code Benchmarks Deep Dive — HumanEval, SWE-bench, LiveCodeBench
Agent Benchmarks Deep Dive — BFCL, WebArena, GAIA
LLM Benchmark Standard Set — The standard evaluation suite in technical reports
Impact of Optimization on Accuracy — Quantization accuracy evaluation methodology
Interpreting Leaderboards and Model Selection (this article) — From leaderboards to practical model selection