Discovery
All entries

Tag

LLM and agent evaluation tools

9 entries tagged with #evals.

Eval harnesses, simulation frameworks, and observability platforms for measuring whether your agent is actually getting better.

GitHubToolFeatured

PostTrainBench - can a CLI agent post-train a base LLM in 10 hours?

Benchmark measuring whether Claude Code, Codex CLI, Gemini CLI, and OpenCode can autonomously improve 4 small base models (Qwen3-1.7B/4B, SmolLM3-3B, Gemma-3-4B) on 7 evals (AIME, BFCL, GPQA, GSM8K, HealthBench, HumanEval, Arena Hard) within a single H100 GPU and 10 hours. Includes agent-as-judge anti-reward-hacking and baseline-replacement penalties for tampering.

Why I saved this - Current leader: Opus 4.6 via Claude Code at 23.2 average. The reward-hacking safeguards (eval tampering and model-substitution detection, baseline-replacement penalty) are the part most agent benchmarks skip.

Browse other tags