Agent & Tool Use Benchmarks
Updated 2026-04-14
Introduction: How Do We Systematically Evaluate API Calling, Browser Operation, and Multi-Step Task Completion?
When OpenAI announced that GPT-4 supports function calling, LLMs leaped from “conversational assistants” to “tool users.” When Anthropic enabled Claude to operate computers and Google let Gemini invoke search and code executors, models further evolved into “autonomous agents.”
But the question that follows is: When a model claims it can use tools, how do we objectively verify that? What is its probability of calling the right API? When facing multi-step tasks, will it go off track at step three? When a tool returns an error, can it recover autonomously?
In the previous article, we dove deep into code evaluation — from HumanEval to the evolution of SWE-bench. Now we enter Agent and Tool Use evaluation — the dimension of LLM evaluation closest to “real-world application scenarios.” This article will answer:
- What levels can Agent capabilities be broken into? What evaluation approach corresponds to each level?
- How does BFCL (Berkeley Function Calling Leaderboard) systematically evaluate function calling?
- Why is GAIA designed to be “easy for humans, hard for AI”?
- From WebArena to tau-bench, how does Web Agent evaluation work?
The Hierarchy of Agent Capabilities
Agent capability is not a single dimension but rather a progressively layered system. Understanding this hierarchy is critical for choosing the right benchmark:
Level 1: Single Function Calling
The most basic capability — the model receives a set of function definitions (JSON Schema) and, based on the user’s request, selects the correct function and fills in the parameters.
- Core requirements: Correct parameter types, reasonable values, ability to handle enum constraints
- Representative benchmark: BFCL (Simple, Multiple categories)
- Largely solved: Mainstream closed-source models achieve accuracy above 85% at this level
Level 2: Multi-Turn Tool Use
The model needs to use tools across multiple turns, process tool return values, and decide the next action based on results.
- Core requirements: State tracking, result parsing, conditional branching decisions
- Representative benchmarks: BFCL Multi-Turn, tau-bench
- Significant gap: Even GPT-4o shows notably lower success rates on multi-turn tasks compared to single calls
Level 3: Autonomous Planning + Execution + Error Recovery
The highest level — the model must autonomously decompose complex tasks, formulate execution plans, invoke multiple tools, handle exceptions, and adjust strategies.
- Core requirements: Task decomposition, long-term planning, error recovery, efficiency optimization
- Representative benchmarks: GAIA, WebArena, AgentBench, SWE-bench (Agent mode)
- Far from solved: On GAIA, human success rate is 92%, while the initial GPT-4 plugin mode achieved only 15%; on WebArena, the best GPT-4 Agent scored only 14.41% (human 78.24%)
Evaluation Dimensions
Regardless of the level, Agent evaluation revolves around four core dimensions:
| Dimension | What It Measures | Typical Metrics |
|---|---|---|
| Call Accuracy | Selecting the correct function, filling in correct parameters | AST match rate, execution pass rate |
| Task Completion Rate | End-to-end fulfillment of user intent | Success rate, partial completion rate |
| Efficiency | Completing tasks with minimal steps/tokens | Average steps, token consumption |
| Robustness | Ability to recover from anomalies | Error recovery rate, hallucinated call rate |
Major Benchmarks at a Glance
The Agent evaluation ecosystem is already quite rich. Organized by evaluation focus, there are three major categories:
Function Calling
- BFCL (Berkeley Function Calling Leaderboard): The most authoritative function calling evaluation, covering single calls, multi-function selection, parallel calls, relevance detection, and multi-turn interaction. Version 3 includes 4,441 test cases. Detailed in the deep dive below.
- Gorilla: BFCL’s predecessor, focused on testing API call generation capabilities.
Web Agent
- WebArena: Tests autonomous Agents in 4 types of real website environments: e-commerce, forums, code collaboration, and content management. GPT-4’s best performance was only 14.41% (human 78.24%), revealing the enormous gap between current models and humans in real web operations.
- VisualWebArena: A multimodal extension of WebArena, adding visual understanding requirements.
General Agent
- GAIA (General AI Assistants): 466 questions, designed with the philosophy of “easy for humans, hard for AI.” Human success rate is 92%, while GPT-4 plugin mode initially scored only 15%. Detailed in the deep dive below.
- tau-bench: Tests Agent multi-turn tool use in simulated customer service scenarios, emphasizing state tracking and error recovery.
- AgentBench: Comprehensive evaluation across 8 types of environments, including operating systems, databases, knowledge graphs, the web, and more.
- SWE-bench (Agent mode): While primarily a code benchmark, its Agent framework mode (e.g., SWE-Agent) essentially evaluates an Agent’s ability to navigate and fix code repositories — see the Code Benchmarks article.
The radar chart below provides a comprehensive visualization of the evaluation dimensions across these benchmarks, showing the relative performance of different models across Agent capability dimensions:
Interactive guide: Check models to compare their Agent capability profiles (up to 3). Hover over vertices to see specific scores and data sources. Note the advantage of closed-source models in function calling and planning capabilities, as well as the competitiveness of open-source models in efficiency. Data has been normalized and is intended for qualitative comparison only.
Trend: From “Can It Call One API Correctly?” to “Can It Autonomously Complete Complex Tasks?”
The evolution of Agent evaluation follows a pattern similar to code evaluation — evaluation granularity keeps increasing:
- 2023: Gorilla established the API call generation evaluation paradigm. The problems were “clean” — given function definitions, just fill in the parameters correctly. GAIA and WebArena were released the same year, pushing evaluation toward real-world task scenarios.
- 2024: BFCL v1 (February 2024) officially launched and rapidly iterated to v3, adding multi-turn interaction, missing parameters, and hallucination detection. tau-bench was also released that year, focusing on long-horizon state management and error recovery.
- 2025: Agent evaluation continues moving toward real-world application scenarios — this is the weakest link for current models.
The core trend is: Single calls are approaching saturation; the gap lies in multi-step, long-duration scenarios that require error correction. This also explains why the absolute scores on GAIA and WebArena are far lower than on BFCL — they test higher-level capabilities.
Deep Dive 1: BFCL — Systematic Function Calling Evaluation
Why Deep Dive into BFCL?
BFCL (Berkeley Function Calling Leaderboard) is currently the most systematic and frequently updated function calling evaluation. Its strengths include:
- Comprehensive coverage: From single calls to multi-turn interaction, from simple parameters to complex nesting
- Rigorous evaluation: Dual verification through AST matching + executable validation
- Continuous updates: Evolving from v1 to v4, reflecting changes in real-world requirements
- Community standard: Virtually every new model release reports BFCL scores
Evaluation Framework
BFCL’s core design decomposes function calling into four progressive categories:
- Simple (Single Function Call): Given one function, fill in the parameters correctly. The most basic test.
- Multiple (Multi-Function Selection): Multiple candidate functions are provided, and the model must select the correct one. Tests the ability to “understand intent -> match function.”
- Parallel (Parallel Calls): The user’s request requires calling multiple functions simultaneously. Tests the ability to “identify parallel requirements.”
- Relevance (Relevance Detection): Function definitions are provided but the user’s request is unrelated. The model should refuse to call any function and reply directly. Tests the ability to “not call when you shouldn’t.”
{
"name": "get_weather",
"parameters": {
"location": { "type": "string" },
"unit": { "type": "string",
"enum": ["celsius", "fahrenheit"] }
}
}get_weather( location="Beijing", unit="celsius" )
Interactive guide: Click the tabs above to switch between the 4 test categories. The left side shows the model’s input (function definitions + user request), and the right side shows the expected output and verification method. Pay attention to the Relevance category — this is a weak point for many models, which tend to produce “over-calling” (hallucinated function calls).
Scoring Methods
BFCL uses two complementary evaluation methods:
- AST Matching: Parses the model’s output into an abstract syntax tree and compares it against the standard answer. Function name, parameter names, parameter values, and parameter types must all match exactly. This is the strictest method.
- Executable Validation: Actually executes the model-generated function call and checks whether the return result is correct. This allows different parameter formats (e.g.,
"Beijing"vs"beijing"), as long as the final execution result is consistent.
Version 3 also introduced dedicated metrics for multi-turn evaluation:
- State-based evaluation: Compares whether the system state after function calls matches expectations (especially important for write/delete operations)
- Subset matching: Allows the model to reach the correct result through different paths, as long as the final result contains all necessary information
Data Scale and Structure
BFCL V3 contains 4,441 test cases:
- Non-live single-turn: 1,390 cases
- Live single-turn: 2,251 cases
- Multi-turn: 800 augmented scenarios (4 categories x 200 cases, with 200 base cases counted separately)
- Hallucination detection: 240 cases
The augmented multi-turn scenarios are particularly noteworthy — they simulate difficult situations encountered in real usage: missing parameters (the user did not provide complete information), missing functions (the needed tool is not available), long context (large amounts of irrelevant information as distraction), and compound scenarios (all three of the above simultaneously).
Known Limitations
Despite its strengths, BFCL has clear limitations:
- Pre-defined function definitions: In real scenarios, Agents may need to discover available tools on their own, rather than selecting from a predefined list
- Evaluation granularity: AST matching is overly strict — sometimes parameter values are semantically equivalent but differ in format (e.g., date format
2025-01-16vsJan 16, 2025) - Does not test tool result handling: Evaluation stops at “was the call correct” and does not cover “what to do with the result”
- Static evaluation: Lacks adaptability testing in dynamic environments — real Agents need to adjust strategies based on tool returns
Deep Dive 2: GAIA — The “Easy for Humans, Hard for AI” Philosophy
Why Deep Dive into GAIA?
GAIA represents the other extreme of Agent evaluation — rather than testing whether a model can call a single function correctly, it tests whether a model can integrate multiple capabilities like a human to complete real-world tasks. Its design philosophy directly challenges the inertia of “score-chasing equals progress.”
Dataset Design
GAIA contains 466 questions, each with a clear, verifiable answer (usually a number, name, or fact). Questions are divided into three difficulty levels:
Level 1 (Basic, approximately 31% of questions) — typically requires 1-2 steps:
“What is the population of the birth city of the 2023 Nobel Prize in Physics laureate?”
Requires: Search -> find the person’s name -> search their birthplace -> look up population data
Level 2 (Intermediate, approximately 53%) — requires 3-5 steps, involving multiple tools:
“Download this Excel file, calculate the average of the third column, then find the prime number closest to that value”
Requires: File download -> data processing -> mathematical calculation -> primality check
Level 3 (Hard, approximately 16%) — requires complex reasoning + multiple tools + long-horizon planning:
Requires cross-referencing information across multiple websites, processing multimodal content, and performing multi-step logical reasoning
The “Easy for Humans, Hard for AI” Design Philosophy
This is GAIA’s most insightful design decision. Its paper explicitly states:
“GAIA’s philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans.”
Traditional benchmarks (like MMLU, MATH) pursue “harder is better” — the questions are difficult for humans too, requiring specialized knowledge. GAIA takes the opposite approach: it designs problems that an ordinary person could solve in minutes with a search engine.
Human success rate: 92%. But the initial GPT-4 plugin mode scored only 15%.
This enormous gap reveals an important truth: The current bottleneck for AI is not knowledge reserves but the execution ability to “string simple steps together.” Operations that humans take for granted — opening a webpage, glancing at a table, remembering intermediate results, adjusting search keywords — remain enormous challenges for AI Agents.
Evaluation Method
GAIA’s evaluation is extremely straightforward: answers are typically short strings, and exact matching suffices. There are no partial scores — either completely correct or wrong. This “all-or-nothing” evaluation ensures that results cannot be gamed: the model either truly completed the task or it did not.
Of the 466 questions, 166 are used for the public development set (answers available) and 300 for the hidden test set (to prevent overfitting).
Implications for the Evaluation Ecosystem
GAIA’s approach has far-reaching implications for the entire LLM evaluation landscape:
- Difficulty does not equal value: Testing “things a normal person can do” may have more diagnostic value than testing “things only experts can do”
- End-to-end matters more than step-by-step: GAIA does not check intermediate steps, only the final answer — this is closer to what users actually care about
- Compositional ability matters more than individual ability: Each sub-step is simple, but combining them creates a blind spot for AI
From Agent Evaluation to Model Selection
This article covered the core dimensions of Agent and Tool Use evaluation. Let us recap the key takeaways:
- Progressive hierarchy: From single function calling to autonomous planning and execution, evaluation difficulty increases exponentially
- BFCL’s systematization: Four test categories x two scoring methods, providing the finest-grained measurement of function calling capability
- GAIA’s philosophy: Using “simple questions” to expose AI’s “execution ability” shortcomings, with the 92% human vs 15% AI gap being highly illuminating
- The real gap: Single calls are approaching saturation, but multi-step Agent scenarios remain far from solved
In the next article, we will move on to Standard Benchmark Set for Model Releases — when a model publishes a technical report, which benchmarks does it typically report? What do these “standard” metrics reflect?
Further Reading
- Berkeley Function Calling Leaderboard — BFCL official leaderboard, continuously updated
- GAIA Paper (arXiv:2311.12983) — Detailed explanation of the “easy for humans, hard for AI” design philosophy
- WebArena Paper (arXiv:2307.13854) — Agent evaluation in real web environments
- tau-bench — Multi-turn Agent evaluation in simulated customer service scenarios
- AgentBench (arXiv:2308.03688) — Comprehensive Agent evaluation across 8 environments