LLM Evaluation and Benchmarks Deep Dive
Systematic understanding of LLM evaluation: from benchmark design principles to specific benchmark deep dives, from optimization accuracy assessment to model selection decisions. Covers knowledge, reasoning, code, and agent evaluation with focus on OpenVINO toolchain and small model assessment.
- 1
Benchmark Landscape and Evaluation Methodology
Intermediate#benchmark#evaluation#methodology#llm-as-judge#contamination - 2
Knowledge & Reasoning Benchmarks
Intermediate#benchmark#reasoning#mmlu#gpqa#math - 3
Code Benchmarks
Intermediate#benchmark#code#humaneval#swe-bench#pass-at-k - 4
Agent & Tool Use Benchmarks
Intermediate#benchmark#agent#function-calling#tool-use#bfcl#gaia - 5
Anatomy of Model Release Benchmark Standard Sets
Intermediate#benchmark#model-release#standard-set#small-models#gemma#phi#qwen - 6
Impact of Optimization on Accuracy
Intermediate#benchmark#quantization#accuracy#perplexity#openvino#lm-eval-harness#llama-cpp - 7
Interpreting Leaderboards and Model Selection
Intermediate#benchmark#leaderboard#model-selection#chatbot-arena#deployment - 8
lm-eval-harness Practical Guide
Advanced#benchmark#lm-eval#evaluation#harness#task-yaml - 9
SWE-bench Practical Guide
Advanced#benchmark#swe-bench#code-evaluation#agent#docker - 10
BFCL Practical Guide
Advanced#benchmark#bfcl#function-calling#tool-use#evaluation