BFCL Practical Guide
Updated 2026-04-16
Agent Benchmarks §Level 1 “Single-shot Function Calling” and the BFCL section, along with the BFCLEvalFlow component, placed BFCL in the agent capability hierarchy — from single-shot function calling (Level 1) to multi-turn tool use to fully agentic systems. This article focuses on the practical side: “the V1–V4 category taxonomy, the AST vs Exec scoring mechanisms, and the Handler architecture for integrating custom models.” By the end, you should be able to run BFCL, understand what each category tests, and know how to extend it to new models.
1 The Category Taxonomy
BFCL has evolved through four major versions, each adding a group of categories that extends coverage from “single-shot function calling” toward “agentic tasks.”
1.1 V1–V4 Evolution
| Generation | New Categories | Core Capability |
|---|---|---|
| V1 | simple_python / simple_java / simple_javascript, parallel, multiple, parallel_multiple, irrelevance | Basic function calling: single call (3 languages) / parallel / multi-candidate + refusal detection |
| V2 | live_simple, live_parallel, live_multiple, live_parallel_multiple, live_irrelevance, live_relevance | Community-contributed real user prompts, generalization |
| V3 | multi_turn_base, multi_turn_miss_func, multi_turn_miss_param, multi_turn_long_context | Multi-turn state tracking: base / missing function / missing parameter / long context |
| V4 | web_search_base, web_search_no_snippet, memory_kv, memory_vector, memory_rec_sum, format_sensitivity | Agentic evaluation: search tool use, memory management, format robustness |
1.2 V4 Total-Score Weight Distribution
The V4 total score aggregates categories with weights (from source code [10, 10, 10, 30, 40]):
- 10% — non_live (V1 base categories)
- 10% — live (V2 categories)
- 10% — irrelevance (V1/V2 refusal categories)
- 30% — multi_turn (V3 categories)
- 40% — agentic (V4 search + memory categories)
format_sensitivityis unscored, reported separately as a standalone metric
V4’s big shift is raising multi_turn and agentic weights — reflecting community consensus that “basic function calling is largely solved; the hard problems are multi-turn state tracking and agent tasks.”
1.3 Category Deep Dive
simple_python/simple_java/simple_javascript— one function, fill parameters correctly, across three languages. JavaScript/Java AST rules differ from Python (parameter naming, default-value handling).parallel— multiple independent function calls; order-agnostic.multiple— pick one correct function from candidates; both selection and parameter filling must be correct.parallel_multiple— combination of the above: select multiple correct functions from candidates and call in parallel.irrelevance— the available functions cannot answer the query; the model should refuse (return natural language rather than fabricate a call). Note the name isirrelevance, notrelevance.multi_turn_miss_func— a function is removed mid-conversation; the model should detect this and inform the user.multi_turn_miss_param— the user hasn’t provided a required parameter; the model should ask rather than invent default values.web_search_base/web_search_no_snippet— search tool usage with / without snippets. The latter is harder — the model decides whether to search again.memory_kv/memory_vector/memory_rec_sum— three memory strategies: key-value, vector retrieval, recursive summarization. Tests model performance with a memory system as an aid.
Click any category tile for details and example prompts. The pie chart on the right shows the V4 total-score weight distribution.
2 Data Format and Scoring
2.1 JSONL Structure
One JSON object per line:
{
"id": "simple_0",
"question": [{ "role": "user", "content": "Find me the nearest hospital within 5km" }],
"function": [
{
"name": "find_hospital",
"description": "Find nearby hospitals",
"parameters": {
"type": "dict",
"properties": {
"location": { "type": "string" },
"radius": { "type": "integer" }
},
"required": ["location"]
}
}
]
}
Two key details:
function[].parametersis a native JSON object, not a string — access fields directly, nojson.loads()needed.- The
typefield uses"dict"rather than OpenAI’s standard"object". Custom handlers must convert this before calling OpenAI’s API (BFCL includes the conversion internally, but roll-your-own handlers must handle it).
2.2 AST Scoring
AST (Abstract Syntax Tree) scoring is BFCL’s default mechanism. It doesn’t actually execute the function; it compares structure:
- Parse the model output into an AST (function name + argument dict)
- Function name match (tolerates dot-to-underscore replacement — OpenAI/Mistral/Google auto-rename
api.searchtoapi_search) - Argument completeness — required arguments must be present; extra arguments fail (keys not in the schema are invalid)
- Type correctness (with tolerant rules:
int → float✓,tuple → list✓) - Value match — strings ignore whitespace and common punctuation differences
AST’s strengths: fast, no real execution environment needed, unaffected by external APIs. Its weakness: it can’t detect “correct structure but wrong value” (e.g. a city name in the wrong country).
2.3 Exec Scoring
Some categories (like rest and executable variants) use Exec scoring:
- Convert the model output into an executable string, e.g.
find_hospital(location="NYC", radius=5) - Actually execute the function call
- Compare the return value against expected output
Exec catches cases AST misses — “format correct but runtime failure” (e.g. API doesn’t exist, parameter value causes runtime error).
2.4 AST vs Exec Scenarios
- AST ✓ + Exec ✓ — arguments correct and execution correct (ideal)
- AST ✓ + Exec ✗ — format correct but the value causes a runtime error (e.g. missing API endpoint, expired token)
- AST ✗ + Exec ✓ — arguments don’t match schema but execution happens to succeed (rare; likely a scorer bug)
find_hospital(location="New York", radius=5)find_hospital(location="New York", radius=5)The component shows three cases side-by-side. Switch cases at the bottom to trace the same model output through both AST and Exec scoring paths.
3 Handler Architecture and Model Integration
3.1 Two Evaluation Modes
BFCL splits models into two evaluation modes:
- FC (Function Calling) mode — models with native tool calling (OpenAI, Claude, Gemini, etc.). Functions are passed through the API’s
toolsparameter; the model returns structured results. - Prompt mode — models without native tool calling. Function definitions are serialized into the prompt text; function calls are parsed from the text output (via regex or custom parsers).
The same model may score 10–20% differently across the two modes — always label the mode when reporting BFCL scores.
3.2 Handler Inheritance
BaseHandler # base class for API models
├── OpenAIHandler
├── ClaudeHandler
├── MistralHandler
├── GoogleHandler
└── CohereHandler
OSSHandler # base for open-source / local models, extends BaseHandler
├── various OSS handlers
3.3 Core Methods
Each handler implements two parsing methods:
decode_ast(result)— raw API response → structured dict ({"name": ..., "arguments": {...}}). Used by AST scoring.decode_execute(result)— raw API response → executable string (func(a=1, b="x")). Used by Exec scoring.
3.4 Custom Handler in Four Steps
- Create a handler file under
model_handler/api_inference/orlocal_inference/ - Inherit from
BaseHandler(API) orOSSHandler(local / open-source); implementdecode_astanddecode_execute - Register a
ModelConfiginconstants/model_config.py(name, org, handler class,is_fc_modelflag, pricing, etc.) - Update
SUPPORTED_MODELS.mdandsupported_models.py
After these four steps, your new model works with bfcl generate --model <your_model>.
4 Running Evaluation and Common Pitfalls
4.1 CLI Workflow
# Install
pip install bfcl-eval
export BFCL_PROJECT_ROOT=/path/to/project
# Generate model outputs (API models)
bfcl generate --model gpt-4o --test-category simple
# Generate model outputs (open-source, via vllm backend)
bfcl generate --model meta-llama/Llama-3-8B \
--backend vllm --num-gpus 2
# Score
bfcl evaluate --model gpt-4o --test-category simple
--test-category accepts a single category (simple), multiple (simple,parallel), or a whole generation (all).
API key configuration — put a .env file under $BFCL_PROJECT_ROOT with variables like OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.
4.2 Common Pitfalls (5)
- Dot-to-underscore replacement — OpenAI / Mistral / Google auto-rename
.to_in function names (tokenizer / API schema constraints). The scorer handles this, but custom handlers should not manually re-replace indecode_ast— double replacement breaks scoring. - Tuple serialization — JSON lacks a tuple type; Python
tupleround-trips aslist. The scorer normalizes this, but custom scoring code must treat them as equivalent. type: "dict"vs"object"— BFCL uses"type": "dict"inparameters, while OpenAI’s standard is"type": "object". Convert before calling the API.- FC vs Prompt score divergence — same model can differ by 10–20% across modes. Always label the mode in reports. Before comparing with someone else’s scores, check which mode they used.
- V4 weight changes — V4 substantially reweighted categories (agentic → 40%). V4 totals are not directly comparable to V3 totals — they measure different things.
5 Summary
- Choosing FC vs Prompt — prefer FC mode for models with native tool calling (higher scores, more stable); use Prompt mode otherwise.
- V4 agentic evaluation — BFCL grew from “can it call a function?” to “can it complete an agent task?” Web search, memory management, and format robustness are all core agent capabilities.
- Complement to other agent benchmarks — BFCL measures function-calling precision, GAIA measures end-to-end task completion, WebArena measures web interaction. Three different dimensions, none substitutes for the others.
For BFCL’s place in the agent capability hierarchy, see Agent Benchmarks §Level 1. For the benchmark ecosystem as a whole, see Benchmark Landscape.