#benchmark
10 articles
Intermediate
Agent & Tool Use Benchmarks
#benchmark
#agent
#function-calling
#tool-use
#bfcl
#gaia
Intermediate
Anatomy of Model Release Benchmark Standard Sets
#benchmark
#model-release
#standard-set
#small-models
#gemma
#phi
#qwen
Intermediate
Benchmark Landscape and Evaluation Methodology
#benchmark
#evaluation
#methodology
#llm-as-judge
#contamination
Intermediate
Code Benchmarks
#benchmark
#code
#humaneval
#swe-bench
#pass-at-k
Intermediate
Interpreting Leaderboards and Model Selection
#benchmark
#leaderboard
#model-selection
#chatbot-arena
#deployment
Intermediate
Impact of Optimization on Accuracy
#benchmark
#quantization
#accuracy
#perplexity
#openvino
#lm-eval-harness
#llama-cpp
Intermediate
Knowledge & Reasoning Benchmarks
#benchmark
#reasoning
#mmlu
#gpqa
#math
Advanced
BFCL Practical Guide
#benchmark
#bfcl
#function-calling
#tool-use
#evaluation
Advanced
lm-eval-harness Practical Guide
#benchmark
#lm-eval
#evaluation
#harness
#task-yaml
Advanced
SWE-bench Practical Guide
#benchmark
#swe-bench
#code-evaluation
#agent
#docker