Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Interpreting Leaderboards and Model Selection

Interpreting Leaderboards and Model Selection

Updated 2026-04-14

Opening: Dozens of Models on the Leaderboard — Which One Should I Pick?

Open any LLM leaderboard and you will see dozens of models with dozens of score columns: MMLU 92.3%, HumanEval 91.7%, Arena ELO 1320… And the rankings differ across leaderboards. Model A ranks third on Chatbot Arena but drops to tenth on the Open LLM Leaderboard.

This makes practical model selection difficult: Which leaderboard should I consult for my specific use case? Which scores actually matter? And once I pick a model, how do I verify it truly works well on my task?

This article is the final piece in the LLM Evaluation and Benchmark learning path. The preceding six articles covered evaluation methodology, the design principles of various benchmarks, and their limitations. Now we bring all that knowledge to bear on one practical problem — model selection.

Comparing Major Leaderboards

There are currently four highly influential LLM leaderboards, each with a distinct positioning, methodology, and set of applicable scenarios.

Chatbot Arena (Arena)

Chatbot Arena (now rebranded as Arena) was launched by LMSYS in April 2023 and is currently the most widely recognized leaderboard for overall model capability.

  • Method: Real users submit the same question to two anonymous models, blind-evaluate which answer is better, and votes are converted into ELO ratings
  • Scale: Millions of cumulative human votes covering 100+ models
  • Categories: Supports viewing rankings by Overall, Coding, Hard Prompts, Math, Creative Writing, and more
  • Strength: Directly reflects real user preferences and is resistant to gaming — each battle prompt comes from a random user question
  • ELO range: Top models typically score between 1250-1500+, enabling precise ranking of inter-model differences

Open LLM Leaderboard V2

The Open LLM Leaderboard, maintained by Hugging Face, focuses on open-source / open-weight models.

  • Method: Automated evaluation using lm-evaluation-harness; V2 uses 6 benchmarks: MMLU-Pro, GPQA, MuSR, MATH Lvl 5, BBH, and IFEval
  • Scale: Tracks thousands of open-source models with over 13,000 community likes
  • Strength: Fully reproducible — anyone can obtain the same results using the same configuration; Apache 2.0 licensed
  • Limitation: Covers only open-source models; static benchmarks carry data contamination risk

LiveBench

LiveBench is a dynamic benchmark specifically designed to combat data contamination.

  • Method: New questions are drawn monthly from the latest math competitions, academic papers, and news events, covering six categories: math, code, reasoning, language comprehension, instruction following, and data analysis
  • Strength: Questions are sourced from information published after model training cutoff dates, making them naturally immune to data contamination; fully automated scoring with no LLM-as-Judge bias
  • Use case: Verifying a model’s “true capability” — if a model scores 95% on MMLU but performs mediocrely on LiveBench, data contamination is likely

Artificial Analysis

Artificial Analysis is the only leaderboard that simultaneously covers quality, performance, and price across three dimensions.

  • Dimensions: Intelligence Index (quality), Price (per million tokens), Output Speed (tokens/s), Latency / TTFT (time to first token), Context Window
  • Scale: Tracks 475+ models (including roughly 216 open-weight models) from 20+ providers
  • Use case: When you need to make quality-cost trade-offs, this is the most practical reference — it can directly tell you “how much intelligence you get per dollar”

Quick Reference Table for the Four Major Leaderboards

LeaderboardCore MethodCoverageBest For
Chatbot ArenaELO blind evaluationClosed-source + Open-sourceOverall conversational ability ranking
Open LLM Leaderboardlm-eval automationOpen-source onlyOpen-source model selection
LiveBenchDynamic question generationClosed-source + Open-sourceVerifying true capability, anti-contamination
Artificial AnalysisMulti-dimensional benchmarkingClosed-source + Open-sourceQuality-price-speed trade-offs

Leaderboard Pitfalls

Leaderboards are useful references, but blindly trusting leaderboard scores is the most common mistake in model selection. Here are several pitfalls you must watch out for.

Pitfall 1: Large Ranking Discrepancies Across Leaderboards

The same model can rank very differently across leaderboards. The reason is straightforward: they measure different things. Chatbot Arena measures user preference (influenced by prompt distribution), the Open LLM Leaderboard measures academic benchmark scores (susceptible to data contamination), and Artificial Analysis also factors in speed and price dimensions.

Countermeasure: Clarify which dimensions you care about and consult the corresponding leaderboard. Do not expect a single leaderboard to answer all questions.

Pitfall 2: Benchmark Gaming

Some model teams optimize specifically for certain benchmarks, including:

  • Mixing benchmark data into instruction tuning
  • Fine-tuning prompt templates to match evaluation formats
  • Selectively reporting favorable scores

Countermeasure: Cross-validate — if a model drastically outperforms on one benchmark but is mediocre on similar benchmarks, be skeptical. Dynamic benchmarks like LiveBench are naturally resistant to this.

Pitfall 3: Metrics Disconnected from Real-World Experience

High benchmark scores do not equal great real-world experience. A model scoring 95% on MMLU may perform poorly on your specific Chinese legal Q&A scenario. Reasons include:

  • The task distribution covered by the benchmark differs from your use case
  • Benchmarks typically test short text, while your scenario may require long contexts
  • Language distribution imbalance — most benchmarks are primarily in English

Countermeasure: Use benchmarks as a screening tool; the final selection must be validated on your own data through a mini evaluation.

Pitfall 4: Sample Bias in Arena

Chatbot Arena voters are predominantly from technical communities (developers, researchers), and their questions skew toward technical topics and English. This means Arena rankings have limited reference value for “general Chinese conversation” or “customer service for non-technical users.”

Countermeasure: Check Arena’s category-specific rankings (e.g., Hard Prompts, Math, Coding) and find the sub-leaderboard most relevant to your scenario.

Scenario-Based Selection Framework: The Four-Step Method

To deal with leaderboard complexity, we propose a practical four-step selection framework:

Step 1: Define Your Task Type

What is your core task? Different tasks have different “gold-standard benchmarks”:

Task TypeCore BenchmarkWhy This One
General conversationChatbot Arena + MT-BenchMost directly reflects conversational experience
Code generationSWE-bench + LiveCodeBenchReal project-level issues + dynamic anti-contamination
Reasoning / MathMMLU-Pro + GPQA Diamond + MATHCoverage from undergraduate to PhD level
Agent / Tool callingBFCL v3 + WebArenaFunction calling + end-to-end agent capability

Step 2: Determine Your Constraints

Technical constraints often matter more than model capability in determining your choice:

  • Latency requirements: Real-time (<1s TTFT) -> exclude large models or opt for streaming; interactive (1-10s) -> most models are viable
  • Deployment mode: Cloud API, local deployment, or hybrid?
  • Hardware limitations: GPU memory for local deployment determines available model sizes and quantization precision
  • Budget: APIs charge per token, and pricing across models can differ by up to 100x

Step 3: Choose a Benchmark Combination

Based on the results of the first two steps, select 3-5 of the most relevant benchmarks. Core principles:

  • Include at least one dynamic benchmark (LiveBench or LiveCodeBench) to guard against data contamination
  • Include at least one benchmark directly corresponding to your task type
  • If cost matters, incorporate Artificial Analysis cost-effectiveness data

Step 4: Mini Evaluation

Leaderboard scores are for “screening”; the final decision must be validated on your own data. A mini evaluation does not need to be large-scale:

  1. Prepare 50-100 representative samples — covering your typical input distribution
  2. Design evaluation criteria — define what constitutes a “good answer” for your scenario
  3. A/B compare candidate models — using LLM-as-Judge or human blind evaluation
  4. Record latency and cost — measure in your actual deployment environment

Rule of thumb: 50 samples are enough to distinguish clear capability gaps; if two models are hard to tell apart on 50 samples, they are genuinely close for your scenario — pick the cheaper or faster one.

Deep Dive: Chatbot Arena Evaluation Mechanism

Among all leaderboards, Chatbot Arena has the greatest influence on industry model selection. Here we examine its inner workings and known limitations in detail.

Blind Evaluation Mechanism

The core of Chatbot Arena is the Anonymous Battle:

  1. A user submits a prompt
  2. The system randomly selects two models (the user does not know which)
  3. Both models respond simultaneously
  4. The user chooses: A is better / B is better / Tie / Both Bad
  5. After voting, model identities are revealed

This mechanism ensures evaluation fairness — users cannot favor a model due to brand preference.

ELO Rating System

Voting results are converted to ELO scores via the Bradley-Terry model. The core idea is:

  • Defeating a high-rated model earns more points
  • Losing to a low-rated model costs more points
  • After sufficiently many battles, the ELO stabilizes

Chatbot Arena uses bootstrap methods to compute confidence intervals; typically, thousands of votes are needed to establish statistically significant ranking differences among top models.

Category Rankings

Arena provides multiple sub-leaderboards to help you find the ranking most relevant to your scenario:

  • Overall: Aggregate ranking across all votes
  • Hard Prompts: Only counts prompts tagged as “difficult,” reflecting a model’s ability to handle complex requests
  • Coding: Rankings for code-related prompts
  • Math: Rankings for mathematical reasoning prompts
  • Creative Writing: Rankings for creative writing prompts
  • Longer Query: Performance on long-prompt scenarios

Known Limitations

  1. User population bias: Voters are predominantly English-speaking technical users, with weaker coverage of Chinese and other languages
  2. Prompt distribution bias: Technical and creative prompts are overrepresented; everyday conversational prompts are underrepresented
  3. Length preference: Research shows users tend to prefer longer answers — even when content quality is the same. Arena is continually refining debiasing methods
  4. Formatting preference: Answers using Markdown formatting (lists, bold text) are more likely to receive votes
  5. Cannot decompose capabilities: A single ELO score compresses information across all dimensions, making it unsuitable for fine-grained capability assessment

Practical advice: When consulting Arena rankings, prioritize the category-specific ranking that matches your scenario rather than Overall. If your scenario involves Chinese conversation, Arena rankings have limited reference value — running your own mini evaluation is recommended.

Small Model Selection: Local Deployment Scenarios

For local deployment scenarios, model selection is more constrained by hardware and requires more careful evaluation.

Major Small Model Families

Model FamilyRepresentative SizesCharacteristicsSuitable Scenarios
Qwen2.50.5B / 3B / 7B / 14B / 72BStrong bilingual (Chinese-English) performance; excellent at code and mathChinese-first scenarios
Llama 3.1/3.38B / 70BOne of the strongest open-source English models; richest community ecosystemEnglish-first + rich tooling
Gemma 22B / 9B / 27BHigh training data quality; good inference efficiencyGeneral tasks + resource-constrained environments
Phi-414BData quality driven; strong reasoning and math capabilitiesReasoning-intensive tasks
Mistral / Mixtral7B / 8x7BArchitectural innovations (sliding window attention, MoE)Long context + high throughput

Choosing a Quantization Scheme

Quantization is nearly mandatory for local deployment. The trade-offs for different quantization levels:

Quantization LevelVRAM Requirement (7B)Typical Accuracy LossSuitable Scenarios
FP16~14 GBBaselineFirst choice when VRAM is sufficient
INT8~7 GB<1%Best balance of performance and accuracy
INT4 (GPTQ/AWQ)~4 GB1-5%Primary option when VRAM is limited
INT4 (GGUF Q4_K_M)~4.5 GB2-6%CPU + llama.cpp deployment
INT3/INT2~2-3 GB5-15%Extremely resource-constrained, experimental only

Key insight: Quantization affects different tasks differently — code and math tasks are most sensitive, while conversational tasks are relatively tolerant. For detailed quantization accuracy evaluation methods, see Impact of Optimization on Accuracy.

A Note on Intel GPUs

If you are using an Intel Arc GPU or integrated GPU (iGPU), you can perform INT4 quantized inference through the OpenVINO toolchain. Suitable model sizes are 7B-13B. For specific deployment optimization details, refer to the intel-igpu-inference path.

Interactive Decision Tool: Find Your Model

Theory covered, use the interactive tool below to turn it into action. Answer a few questions about your scenario and get personalized benchmark combination and model range recommendations.

模型选型决策树

回答几个问题,获取个性化推荐

第 1 步 / 共 3 步

你的核心任务是什么?

Tip: This decision tree provides an initial direction. After receiving a recommendation, be sure to follow the four-step method above and run a mini evaluation to confirm your final choice.

From “Pick One Model” to “Dynamically Select Models”

By this point, you know how to select a suitable model based on leaderboards and scenario constraints. But in real production environments, a more powerful strategy is — don’t pick just one; use them all.

This is the idea behind Model Routing:

  1. Simple requests (e.g., “translate this sentence”) -> route to a small model or lightweight API (e.g., GPT-4o-mini) for fast response and low cost
  2. Complex requests (e.g., “analyze the security vulnerabilities in this code and provide a fix”) -> route to a large model or flagship API (e.g., Claude Opus) to ensure quality
  3. Routing strategies can be based on classifiers, cascade verification, or even online RL learning

Benchmark data plays a key role in routing systems:

  • Each model’s scores across different benchmarks form a capability profile
  • Capability profile + task classifier = the data foundation for routing rules
  • Artificial Analysis cost-effectiveness data is used directly for cost optimization

This is the evolution from static selection to dynamic selection. If you chose “hybrid deployment” in the decision tree above, model routing is your next stop.

Continue reading -> Model Routing: Intelligent Model Selection and Hybrid Inference path, evolving from “pick one model” to “let the system automatically choose the optimal model.”

  • Chatbot Arena: The most authoritative overall capability ranking, based on real user anonymous battles
  • Open LLM Leaderboard V2: The standard reference for open-source models, powered by lm-eval-harness
  • LiveBench: Dynamic benchmark, naturally resistant to data contamination
  • Artificial Analysis: The only leaderboard covering quality, speed, and price simultaneously

Further Reading

Path Summary

This article is the final piece in the LLM Evaluation and Benchmark Deep Dive learning path. A review of the full path:

  1. Benchmark Landscape and Evaluation Methodology — Establishing a global classification framework
  2. Knowledge and Reasoning Benchmarks Deep Dive — MMLU-Pro, GPQA, MATH, and more
  3. Code Benchmarks Deep Dive — HumanEval, SWE-bench, LiveCodeBench
  4. Agent Benchmarks Deep Dive — BFCL, WebArena, GAIA
  5. LLM Benchmark Standard Set — The standard evaluation suite in technical reports
  6. Impact of Optimization on Accuracy — Quantization accuracy evaluation methodology
  7. Interpreting Leaderboards and Model Selection (this article) — From leaderboards to practical model selection