Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

LLM Evaluation and Benchmarks Deep Dive

Systematic understanding of LLM evaluation: from benchmark design principles to specific benchmark deep dives, from optimization accuracy assessment to model selection decisions. Covers knowledge, reasoning, code, and agent evaluation with focus on OpenVINO toolchain and small model assessment.

  1. 1

    Benchmark Landscape and Evaluation Methodology

    Intermediate
    #benchmark#evaluation#methodology#llm-as-judge#contamination
  2. 2

    Knowledge & Reasoning Benchmarks

    Intermediate
    #benchmark#reasoning#mmlu#gpqa#math
  3. 3

    Code Benchmarks

    Intermediate
    #benchmark#code#humaneval#swe-bench#pass-at-k
  4. 4

    Agent & Tool Use Benchmarks

    Intermediate
    #benchmark#agent#function-calling#tool-use#bfcl#gaia
  5. 5

    Anatomy of Model Release Benchmark Standard Sets

    Intermediate
    #benchmark#model-release#standard-set#small-models#gemma#phi#qwen
  6. 6

    Impact of Optimization on Accuracy

    Intermediate
    #benchmark#quantization#accuracy#perplexity#openvino#lm-eval-harness#llama-cpp
  7. 7

    Interpreting Leaderboards and Model Selection

    Intermediate
    #benchmark#leaderboard#model-selection#chatbot-arena#deployment
  8. 8

    lm-eval-harness Practical Guide

    Advanced
    #benchmark#lm-eval#evaluation#harness#task-yaml
  9. 9

    SWE-bench Practical Guide

    Advanced
    #benchmark#swe-bench#code-evaluation#agent#docker
  10. 10

    BFCL Practical Guide

    Advanced
    #benchmark#bfcl#function-calling#tool-use#evaluation