Interpreting and running standardized language model benchmarks and evaluation datasets for both generalized and task specific performance assessments!
Resources:
lm-evaluation-harness: github.com/EleutherAI/lm-evaluation-harness
lm-evaluation-harness setup script: drive.google.com/file/d/1oWoWSBUdCiB82R-8m52nv_-5p…
OpenLLM Leaderboard: huggingface.co/spaces/open-llm-leaderboard/open_ll…
YALL Leaderboard: huggingface.co/spaces/mlabonne/Yet_Another_LLM_Lea…
MMLU Paper: arxiv.org/pdf/2009.03300
ARC Paper: arxiv.org/pdf/1803.05457
Orpo-Llama-3.2-1B-40k Model: huggingface.co/AdamLucek/Orpo-Llama-3.2-1B-40k
Chapters:
00:00 - Introduction
01:21 - What Are LLM Benchmarks? MMLU Example
05:09 - Additional Benchmark Examples
09:03 - How to Interpret Benchmark Evaluations
14:40 - Running Evaluations: Arc-Challenge Setup
16:49 - Running Evaluations: lm-evaluation-harness Repo
19:02 - Running Evaluations: CLI Environment Setup
21:42 - Running Evaluations: Defining lm-eval Arguments
24:27 - Running Evaluations: Starting Eval Run
26:49 - Running Evaluations: Interpreting Results
28:26 - Individual Implementation Differences
30:00 - Final Thoughts
#ai #datascience #programming