BFCL Practical Guide | LLM Learning

Agent Benchmarks §Level 1 “Single-shot Function Calling” and the BFCL section, along with the BFCLEvalFlow component, placed BFCL in the agent capability hierarchy — from single-shot function calling (Level 1) to multi-turn tool use to fully agentic systems. This article focuses on the practical side: “the V1–V4 category taxonomy, the AST vs Exec scoring mechanisms, and the Handler architecture for integrating custom models.” By the end, you should be able to run BFCL, understand what each category tests, and know how to extend it to new models.

1 The Category Taxonomy

BFCL has evolved through four major versions, each adding a group of categories that extends coverage from “single-shot function calling” toward “agentic tasks.”

1.1 V1–V4 Evolution

Generation	New Categories	Core Capability
V1	`simple_python` / `simple_java` / `simple_javascript`, `parallel`, `multiple`, `parallel_multiple`, `irrelevance`	Basic function calling: single call (3 languages) / parallel / multi-candidate + refusal detection
V2	`live_simple`, `live_parallel`, `live_multiple`, `live_parallel_multiple`, `live_irrelevance`, `live_relevance`	Community-contributed real user prompts, generalization
V3	`multi_turn_base`, `multi_turn_miss_func`, `multi_turn_miss_param`, `multi_turn_long_context`	Multi-turn state tracking: base / missing function / missing parameter / long context
V4	`web_search_base`, `web_search_no_snippet`, `memory_kv`, `memory_vector`, `memory_rec_sum`, `format_sensitivity`	Agentic evaluation: search tool use, memory management, format robustness

1.2 V4 Total-Score Weight Distribution

The V4 total score aggregates categories with weights (from source code [10, 10, 10, 30, 40]):

10% — non_live (V1 base categories)
10% — live (V2 categories)
10% — irrelevance (V1/V2 refusal categories)
30% — multi_turn (V3 categories)
40% — agentic (V4 search + memory categories)
format_sensitivity is unscored, reported separately as a standalone metric

V4’s big shift is raising multi_turn and agentic weights — reflecting community consensus that “basic function calling is largely solved; the hard problems are multi-turn state tracking and agent tasks.”

1.3 Category Deep Dive

simple_python / simple_java / simple_javascript — one function, fill parameters correctly, across three languages. JavaScript/Java AST rules differ from Python (parameter naming, default-value handling).
parallel — multiple independent function calls; order-agnostic.
multiple — pick one correct function from candidates; both selection and parameter filling must be correct.
parallel_multiple — combination of the above: select multiple correct functions from candidates and call in parallel.
irrelevance — the available functions cannot answer the query; the model should refuse (return natural language rather than fabricate a call). Note the name is irrelevance, not relevance.
multi_turn_miss_func — a function is removed mid-conversation; the model should detect this and inform the user.
multi_turn_miss_param — the user hasn’t provided a required parameter; the model should ask rather than invent default values.
web_search_base / web_search_no_snippet — search tool usage with / without snippets. The latter is harder — the model decides whether to search again.
memory_kv / memory_vector / memory_rec_sum — three memory strategies: key-value, vector retrieval, recursive summarization. Tests model performance with a memory system as an aid.

BFCL V1-V4 Category Matrix + V4 Weights

Click a category for details; pie shows V4 score weights

Basic function calling

Real community prompts

Multi-turn tracking

Agentic evaluation

V4 Total Score Weights

non_live (V1)10%

live (V2)10%

irrelevance10%

multi_turn (V3)30%

agentic (V4)40%

format_sensitivity reported separately, unscored

Click a category to see details

Click any category tile for details and example prompts. The pie chart on the right shows the V4 total-score weight distribution.

2 Data Format and Scoring

2.1 JSONL Structure

One JSON object per line:

{
  "id": "simple_0",
  "question": [{ "role": "user", "content": "Find me the nearest hospital within 5km" }],
  "function": [
    {
      "name": "find_hospital",
      "description": "Find nearby hospitals",
      "parameters": {
        "type": "dict",
        "properties": {
          "location": { "type": "string" },
          "radius": { "type": "integer" }
        },
        "required": ["location"]
      }
    }
  ]
}

Two key details:

function[].parameters is a native JSON object, not a string — access fields directly, no json.loads() needed.
The type field uses "dict" rather than OpenAI’s standard "object". Custom handlers must convert this before calling OpenAI’s API (BFCL includes the conversion internally, but roll-your-own handlers must handle it).

2.2 AST Scoring

AST (Abstract Syntax Tree) scoring is BFCL’s default mechanism. It doesn’t actually execute the function; it compares structure:

Parse the model output into an AST (function name + argument dict)
Function name match (tolerates dot-to-underscore replacement — OpenAI/Mistral/Google auto-rename api.search to api_search)
Argument completeness — required arguments must be present; extra arguments fail (keys not in the schema are invalid)
Type correctness (with tolerant rules: int → float ✓, tuple → list ✓)
Value match — strings ignore whitespace and common punctuation differences

AST’s strengths: fast, no real execution environment needed, unaffected by external APIs. Its weakness: it can’t detect “correct structure but wrong value” (e.g. a city name in the wrong country).

2.3 Exec Scoring

Some categories (like rest and executable variants) use Exec scoring:

Convert the model output into an executable string, e.g. find_hospital(location="NYC", radius=5)
Actually execute the function call
Compare the return value against expected output

Exec catches cases AST misses — “format correct but runtime failure” (e.g. API doesn’t exist, parameter value causes runtime error).

2.4 AST vs Exec Scenarios

AST ✓ + Exec ✓ — arguments correct and execution correct (ideal)
AST ✓ + Exec ✗ — format correct but the value causes a runtime error (e.g. missing API endpoint, expired token)
AST ✗ + Exec ✓ — arguments don’t match schema but execution happens to succeed (rare; likely a scorer bug)

AST vs Exec Scoring Comparison

Same model output; two scoring mechanisms may diverge

Model Output:

find_hospital(location="New York", radius=5)

Expected Call:

find_hospital(location="New York", radius=5)

AST Scoring (Structure)

PASS

✓

Extract function name

find_hospital ✓

✓

Parse arguments

{location, radius} ✓

✓

Type check

string, integer ✓

✓

Value match

exact match

Exec Scoring (Execute)

PASS

✓

Build exec string

✓

Execute call

returns { ... hospitals list }

✓

Compare output

matches expected

The component shows three cases side-by-side. Switch cases at the bottom to trace the same model output through both AST and Exec scoring paths.

3 Handler Architecture and Model Integration

3.1 Two Evaluation Modes

BFCL splits models into two evaluation modes:

FC (Function Calling) mode — models with native tool calling (OpenAI, Claude, Gemini, etc.). Functions are passed through the API’s tools parameter; the model returns structured results.
Prompt mode — models without native tool calling. Function definitions are serialized into the prompt text; function calls are parsed from the text output (via regex or custom parsers).

The same model may score 10–20% differently across the two modes — always label the mode when reporting BFCL scores.

3.2 Handler Inheritance

BaseHandler           # base class for API models
├── OpenAIHandler
├── ClaudeHandler
├── MistralHandler
├── GoogleHandler
└── CohereHandler

OSSHandler            # base for open-source / local models, extends BaseHandler
├── various OSS handlers

3.3 Core Methods

Each handler implements two parsing methods:

decode_ast(result) — raw API response → structured dict ({"name": ..., "arguments": {...}}). Used by AST scoring.
decode_execute(result) — raw API response → executable string (func(a=1, b="x")). Used by Exec scoring.

3.4 Custom Handler in Four Steps

Create a handler file under model_handler/api_inference/ or local_inference/
Inherit from BaseHandler (API) or OSSHandler (local / open-source); implement decode_ast and decode_execute
Register a ModelConfig in constants/model_config.py (name, org, handler class, is_fc_model flag, pricing, etc.)
Update SUPPORTED_MODELS.md and supported_models.py

After these four steps, your new model works with bfcl generate --model <your_model>.

4 Running Evaluation and Common Pitfalls

4.1 CLI Workflow

# Install
pip install bfcl-eval
export BFCL_PROJECT_ROOT=/path/to/project

# Generate model outputs (API models)
bfcl generate --model gpt-4o --test-category simple

# Generate model outputs (open-source, via vllm backend)
bfcl generate --model meta-llama/Llama-3-8B \
  --backend vllm --num-gpus 2

# Score
bfcl evaluate --model gpt-4o --test-category simple

--test-category accepts a single category (simple), multiple (simple,parallel), or a whole generation (all).

API key configuration — put a .env file under $BFCL_PROJECT_ROOT with variables like OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.

4.2 Common Pitfalls (5)

Dot-to-underscore replacement — OpenAI / Mistral / Google auto-rename . to _ in function names (tokenizer / API schema constraints). The scorer handles this, but custom handlers should not manually re-replace in decode_ast — double replacement breaks scoring.
Tuple serialization — JSON lacks a tuple type; Python tuple round-trips as list. The scorer normalizes this, but custom scoring code must treat them as equivalent.
type: "dict" vs "object" — BFCL uses "type": "dict" in parameters, while OpenAI’s standard is "type": "object". Convert before calling the API.
FC vs Prompt score divergence — same model can differ by 10–20% across modes. Always label the mode in reports. Before comparing with someone else’s scores, check which mode they used.
V4 weight changes — V4 substantially reweighted categories (agentic → 40%). V4 totals are not directly comparable to V3 totals — they measure different things.

5 Summary

Choosing FC vs Prompt — prefer FC mode for models with native tool calling (higher scores, more stable); use Prompt mode otherwise.
V4 agentic evaluation — BFCL grew from “can it call a function?” to “can it complete an agent task?” Web search, memory management, and format robustness are all core agent capabilities.
Complement to other agent benchmarks — BFCL measures function-calling precision, GAIA measures end-to-end task completion, WebArena measures web interaction. Three different dimensions, none substitutes for the others.

For BFCL’s place in the agent capability hierarchy, see Agent Benchmarks §Level 1. For the benchmark ecosystem as a whole, see Benchmark Landscape.