Agent & Tool Use Benchmarks

Introduction: How Do We Systematically Evaluate API Calling, Browser Operation, and Multi-Step Task Completion?

When OpenAI announced that GPT-4 supports function calling, LLMs leaped from “conversational assistants” to “tool users.” When Anthropic enabled Claude to operate computers and Google let Gemini invoke search and code executors, models further evolved into “autonomous agents.”

But the question that follows is: When a model claims it can use tools, how do we objectively verify that? What is its probability of calling the right API? When facing multi-step tasks, will it go off track at step three? When a tool returns an error, can it recover autonomously?

In the previous article, we dove deep into code evaluation — from HumanEval to the evolution of SWE-bench. Now we enter Agent and Tool Use evaluation — the dimension of LLM evaluation closest to “real-world application scenarios.” This article will answer:

What levels can Agent capabilities be broken into? What evaluation approach corresponds to each level?
How does BFCL (Berkeley Function Calling Leaderboard) systematically evaluate function calling?
Why is GAIA designed to be “easy for humans, hard for AI”?
From WebArena to tau-bench, how does Web Agent evaluation work?

The Hierarchy of Agent Capabilities

Agent capability is not a single dimension but rather a progressively layered system. Understanding this hierarchy is critical for choosing the right benchmark:

Level 1: Single Function Calling

The most basic capability — the model receives a set of function definitions (JSON Schema) and, based on the user’s request, selects the correct function and fills in the parameters.

Core requirements: Correct parameter types, reasonable values, ability to handle enum constraints
Representative benchmark: BFCL (Simple, Multiple categories)
Largely solved: Mainstream closed-source models achieve accuracy above 85% at this level

Level 2: Multi-Turn Tool Use

The model needs to use tools across multiple turns, process tool return values, and decide the next action based on results.

Core requirements: State tracking, result parsing, conditional branching decisions
Representative benchmarks: BFCL Multi-Turn, tau-bench
Significant gap: Even GPT-4o shows notably lower success rates on multi-turn tasks compared to single calls

Level 3: Autonomous Planning + Execution + Error Recovery

The highest level — the model must autonomously decompose complex tasks, formulate execution plans, invoke multiple tools, handle exceptions, and adjust strategies.

Core requirements: Task decomposition, long-term planning, error recovery, efficiency optimization
Representative benchmarks: GAIA, WebArena, AgentBench, SWE-bench (Agent mode)
Far from solved: On GAIA, human success rate is 92%, while the initial GPT-4 plugin mode achieved only 15%; on WebArena, the best GPT-4 Agent scored only 14.41% (human 78.24%)

Evaluation Dimensions

Regardless of the level, Agent evaluation revolves around four core dimensions:

Dimension	What It Measures	Typical Metrics
Call Accuracy	Selecting the correct function, filling in correct parameters	AST match rate, execution pass rate
Task Completion Rate	End-to-end fulfillment of user intent	Success rate, partial completion rate
Efficiency	Completing tasks with minimal steps/tokens	Average steps, token consumption
Robustness	Ability to recover from anomalies	Error recovery rate, hallucinated call rate

Major Benchmarks at a Glance

The Agent evaluation ecosystem is already quite rich. Organized by evaluation focus, there are three major categories:

Function Calling

BFCL (Berkeley Function Calling Leaderboard): The most authoritative function calling evaluation, covering single calls, multi-function selection, parallel calls, relevance detection, and multi-turn interaction. Version 3 includes 4,441 test cases. Detailed in the deep dive below.
Gorilla: BFCL’s predecessor, focused on testing API call generation capabilities.

Web Agent

WebArena: Tests autonomous Agents in 4 types of real website environments: e-commerce, forums, code collaboration, and content management. GPT-4’s best performance was only 14.41% (human 78.24%), revealing the enormous gap between current models and humans in real web operations.
VisualWebArena: A multimodal extension of WebArena, adding visual understanding requirements.

General Agent

GAIA (General AI Assistants): 466 questions, designed with the philosophy of “easy for humans, hard for AI.” Human success rate is 92%, while GPT-4 plugin mode initially scored only 15%. Detailed in the deep dive below.
tau-bench: Tests Agent multi-turn tool use in simulated customer service scenarios, emphasizing state tracking and error recovery.
AgentBench: Comprehensive evaluation across 8 types of environments, including operating systems, databases, knowledge graphs, the web, and more.
SWE-bench (Agent mode): While primarily a code benchmark, its Agent framework mode (e.g., SWE-Agent) essentially evaluates an Agent’s ability to navigate and fix code repositories — see the Code Benchmarks article.

The radar chart below provides a comprehensive visualization of the evaluation dimensions across these benchmarks, showing the relative performance of different models across Agent capability dimensions:

Agent 能力雷达图

GPT-4oClaude 3.5 SonnetGemini 1.5 ProLlama 3.1 70BQwen 2.5 72B(最多选 3 个)

数据来源：BFCL v3、GAIA、τ-bench、AgentBench 等排行榜，经归一化至 0-100 分。仅供定性比较，不同 benchmark 量纲不同。

Interactive guide: Check models to compare their Agent capability profiles (up to 3). Hover over vertices to see specific scores and data sources. Note the advantage of closed-source models in function calling and planning capabilities, as well as the competitiveness of open-source models in efficiency. Data has been normalized and is intended for qualitative comparison only.

Trend: From “Can It Call One API Correctly?” to “Can It Autonomously Complete Complex Tasks?”

The evolution of Agent evaluation follows a pattern similar to code evaluation — evaluation granularity keeps increasing:

2023: Gorilla established the API call generation evaluation paradigm. The problems were “clean” — given function definitions, just fill in the parameters correctly. GAIA and WebArena were released the same year, pushing evaluation toward real-world task scenarios.
2024: BFCL v1 (February 2024) officially launched and rapidly iterated to v3, adding multi-turn interaction, missing parameters, and hallucination detection. tau-bench was also released that year, focusing on long-horizon state management and error recovery.
2025: Agent evaluation continues moving toward real-world application scenarios — this is the weakest link for current models.

The core trend is: Single calls are approaching saturation; the gap lies in multi-step, long-duration scenarios that require error correction. This also explains why the absolute scores on GAIA and WebArena are far lower than on BFCL — they test higher-level capabilities.

Deep Dive 1: BFCL — Systematic Function Calling Evaluation

Why Deep Dive into BFCL?

BFCL (Berkeley Function Calling Leaderboard) is currently the most systematic and frequently updated function calling evaluation. Its strengths include:

Comprehensive coverage: From single calls to multi-turn interaction, from simple parameters to complex nesting
Rigorous evaluation: Dual verification through AST matching + executable validation
Continuous updates: Evolving from v1 to v4, reflecting changes in real-world requirements
Community standard: Virtually every new model release reports BFCL scores

Evaluation Framework

BFCL’s core design decomposes function calling into four progressive categories:

Simple (Single Function Call): Given one function, fill in the parameters correctly. The most basic test.
Multiple (Multi-Function Selection): Multiple candidate functions are provided, and the model must select the correct one. Tests the ability to “understand intent -> match function.”
Parallel (Parallel Calls): The user’s request requires calling multiple functions simultaneously. Tests the ability to “identify parallel requirements.”
Relevance (Relevance Detection): Function definitions are provided but the user’s request is unrelated. The model should refuse to call any function and reply directly. Tests the ability to “not call when you shouldn’t.”

BFCL 评测流程示例

给定一个函数，正确填充参数

1输入

可用函数

{
  "name": "get_weather",
  "parameters": {
    "location": { "type": "string" },
    "unit": { "type": "string",
              "enum": ["celsius", "fahrenheit"] }
  }
}

用户请求

"北京今天天气怎么样？用摄氏度"

2输出 & 验证

模型输出

get_weather(
  location="Beijing",
  unit="celsius"
)

评估结果

✅ AST 匹配: 函数名正确，参数类型和值正确

BFCL 使用 AST（抽象语法树）匹配和可执行验证两种方式评估函数调用正确性。

Interactive guide: Click the tabs above to switch between the 4 test categories. The left side shows the model’s input (function definitions + user request), and the right side shows the expected output and verification method. Pay attention to the Relevance category — this is a weak point for many models, which tend to produce “over-calling” (hallucinated function calls).

Scoring Methods

BFCL uses two complementary evaluation methods:

AST Matching: Parses the model’s output into an abstract syntax tree and compares it against the standard answer. Function name, parameter names, parameter values, and parameter types must all match exactly. This is the strictest method.
Executable Validation: Actually executes the model-generated function call and checks whether the return result is correct. This allows different parameter formats (e.g., "Beijing" vs "beijing"), as long as the final execution result is consistent.

Version 3 also introduced dedicated metrics for multi-turn evaluation:

State-based evaluation: Compares whether the system state after function calls matches expectations (especially important for write/delete operations)
Subset matching: Allows the model to reach the correct result through different paths, as long as the final result contains all necessary information

Data Scale and Structure

BFCL V3 contains 4,441 test cases:

Non-live single-turn: 1,390 cases
Live single-turn: 2,251 cases
Multi-turn: 800 augmented scenarios (4 categories x 200 cases, with 200 base cases counted separately)
Hallucination detection: 240 cases

The augmented multi-turn scenarios are particularly noteworthy — they simulate difficult situations encountered in real usage: missing parameters (the user did not provide complete information), missing functions (the needed tool is not available), long context (large amounts of irrelevant information as distraction), and compound scenarios (all three of the above simultaneously).

Known Limitations

Despite its strengths, BFCL has clear limitations:

Pre-defined function definitions: In real scenarios, Agents may need to discover available tools on their own, rather than selecting from a predefined list
Evaluation granularity: AST matching is overly strict — sometimes parameter values are semantically equivalent but differ in format (e.g., date format 2025-01-16 vs Jan 16, 2025)
Does not test tool result handling: Evaluation stops at “was the call correct” and does not cover “what to do with the result”
Static evaluation: Lacks adaptability testing in dynamic environments — real Agents need to adjust strategies based on tool returns

Deep Dive 2: GAIA — The “Easy for Humans, Hard for AI” Philosophy

Why Deep Dive into GAIA?

GAIA represents the other extreme of Agent evaluation — rather than testing whether a model can call a single function correctly, it tests whether a model can integrate multiple capabilities like a human to complete real-world tasks. Its design philosophy directly challenges the inertia of “score-chasing equals progress.”

Dataset Design

GAIA contains 466 questions, each with a clear, verifiable answer (usually a number, name, or fact). Questions are divided into three difficulty levels:

Level 1 (Basic, approximately 31% of questions) — typically requires 1-2 steps:

“What is the population of the birth city of the 2023 Nobel Prize in Physics laureate?”

Requires: Search -> find the person’s name -> search their birthplace -> look up population data

Level 2 (Intermediate, approximately 53%) — requires 3-5 steps, involving multiple tools:

“Download this Excel file, calculate the average of the third column, then find the prime number closest to that value”

Requires: File download -> data processing -> mathematical calculation -> primality check

Level 3 (Hard, approximately 16%) — requires complex reasoning + multiple tools + long-horizon planning:

Requires cross-referencing information across multiple websites, processing multimodal content, and performing multi-step logical reasoning

The “Easy for Humans, Hard for AI” Design Philosophy

This is GAIA’s most insightful design decision. Its paper explicitly states:

“GAIA’s philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans.”

Traditional benchmarks (like MMLU, MATH) pursue “harder is better” — the questions are difficult for humans too, requiring specialized knowledge. GAIA takes the opposite approach: it designs problems that an ordinary person could solve in minutes with a search engine.

Human success rate: 92%. But the initial GPT-4 plugin mode scored only 15%.

This enormous gap reveals an important truth: The current bottleneck for AI is not knowledge reserves but the execution ability to “string simple steps together.” Operations that humans take for granted — opening a webpage, glancing at a table, remembering intermediate results, adjusting search keywords — remain enormous challenges for AI Agents.

Evaluation Method

GAIA’s evaluation is extremely straightforward: answers are typically short strings, and exact matching suffices. There are no partial scores — either completely correct or wrong. This “all-or-nothing” evaluation ensures that results cannot be gamed: the model either truly completed the task or it did not.

Of the 466 questions, 166 are used for the public development set (answers available) and 300 for the hidden test set (to prevent overfitting).

Implications for the Evaluation Ecosystem

GAIA’s approach has far-reaching implications for the entire LLM evaluation landscape:

Difficulty does not equal value: Testing “things a normal person can do” may have more diagnostic value than testing “things only experts can do”
End-to-end matters more than step-by-step: GAIA does not check intermediate steps, only the final answer — this is closer to what users actually care about
Compositional ability matters more than individual ability: Each sub-step is simple, but combining them creates a blind spot for AI

From Agent Evaluation to Model Selection

This article covered the core dimensions of Agent and Tool Use evaluation. Let us recap the key takeaways:

Progressive hierarchy: From single function calling to autonomous planning and execution, evaluation difficulty increases exponentially
BFCL’s systematization: Four test categories x two scoring methods, providing the finest-grained measurement of function calling capability
GAIA’s philosophy: Using “simple questions” to expose AI’s “execution ability” shortcomings, with the 92% human vs 15% AI gap being highly illuminating
The real gap: Single calls are approaching saturation, but multi-step Agent scenarios remain far from solved

In the next article, we will move on to Standard Benchmark Set for Model Releases — when a model publishes a technical report, which benchmarks does it typically report? What do these “standard” metrics reflect?