Tag

#agent-eval

2 entries tagged with #agent-eval.

MCPMark - stress-testing MCP benchmark

Benchmark harness that evaluates models and agents on real-world MCP usage. Comparable scores across servers and frontier models.

#mcp #benchmark #evaluation #agent-eval #tool-use

GitHub Tool

Future AGI - open-source LLM evals & observability

End-to-end platform for evaluating, observing, and improving LLM and agent apps. Tracing, evals, simulations, datasets, and prompt management in one project.

#llm-eval #observability #tracing #agent-eval #self-hosted

Browse other tags

#claude-code21 #cli20 #mcp16 #go14 #rust14 #ai-agent10 #codex10 #self-hosted10 #devops9 #observability8 #python8 #developer-tools7