Tag

LLM and agent evaluation tools

9 entries tagged with #evals.

Eval harnesses, simulation frameworks, and observability platforms for measuring whether your agent is actually getting better.

GitHub ToolFeatured

PostTrainBench - can a CLI agent post-train a base LLM in 10 hours?

Benchmark measuring whether Claude Code, Codex CLI, Gemini CLI, and OpenCode can autonomously improve 4 small base models (Qwen3-1.7B/4B, SmolLM3-3B, Gemma-3-4B) on 7 evals (AIME, BFCL, GPQA, GSM8K, HealthBench, HumanEval, Arena Hard) within a single H100 GPU and 10 hours. Includes agent-as-judge anti-reward-hacking and baseline-replacement penalties for tampering.

Why I saved this - Current leader: Opus 4.6 via Claude Code at 23.2 average. The reward-hacking safeguards (eval tampering and model-substitution detection, baseline-replacement penalty) are the part most agent benchmarks skip.

#evals #claude-code #codex #gemini-cli #opencode

GitHub Library

OQP - verification protocol for AI agents

MCP-compatible spec defining four endpoints (capabilities, workflows, execute, assess-risk) so agents can prove a shipped change satisfies business requirements before it goes live.

#mcp #agent-security #evals #verification #ai-agent

GitHub Library

LABE - legal action boundary eval

Public benchmark that tests an agent at the moment it's about to take a high-impact legal action. Same harness, baseline vs verified, measures unjustified action drops and goal-completion gains.

#evals #agent-security #ai-agent #benchmark

GitHub Tool

dirac - open-source coding agent for TerminalBench

Open-source coding agent that scored 65.2% on TerminalBench with Gemini 3 flash, beating Junie CLI and Google's official harness. Run leaderboard-compliant with full transcripts and no AGENTS.md tricks.

#evals #ai-agent #cli

GitHub Library

passmark - Playwright AI regression testing

Open-source Playwright library for AI-driven browser regression testing with intelligent caching, auto-healing locators, and multi-model verification. Designed to keep flaky AI tests stable across model versions.

#evals #ai-agent #typescript #developer-tools

GitHub Hack

awesome-autoresearch - autonomous research agent loops

Curated list of self-improvement loops, research agents, and autoresearch systems following Karpathy's framing. Useful index when designing multi-step agent harnesses.

#claude-code #ai-agent #multi-agent #evals

GitHub Tool

litmus - unit tests for AI prompts

TypeScript CLI that runs unit tests against prompts: compare models, check outputs, track cost. Treats prompts as code that needs CI.

#evals #typescript #cli #prompt-engineering #developer-tools

GitHub Article

awesome-harness-engineering - agent harness toolkit

Curated awesome list for AI agent harness engineering: tools, patterns, evals, memory, MCP, permissions, observability, and orchestration.

#awesome-list #agent-harness #mcp #evals #agent-memory

GitHub Library

ADK-JS - Google's agent dev kit for TypeScript

Code-first toolkit for building, evaluating, and deploying agents on Google's stack. Tool wiring, traces, and eval harness in one package.

#agent-framework #google #typescript #evals #agentic-ai

Browse other tags

#developer-tools200 #claude-code177 #cli169 #rust99 #mcp88 #typescript84 #codex76 #go74 #ai-agent68 #python65 #self-hosted58 #devops51