MCPMark - stress-testing MCP benchmark
Benchmark harness that evaluates models and agents on real-world MCP usage. Comparable scores across servers and frontier models.
Tag
2 entries tagged with #agent-eval.
Benchmark harness that evaluates models and agents on real-world MCP usage. Comparable scores across servers and frontier models.
End-to-end platform for evaluating, observing, and improving LLM and agent apps. Tracing, evals, simulations, datasets, and prompt management in one project.