awesome-harness-engineering - agent harness toolkit
Curated awesome list for AI agent harness engineering: tools, patterns, evals, memory, MCP, permissions, observability, and orchestration.
There are many "awesome AI agent" lists. Most of them collect frameworks. This one collects something more specific and more useful: the harness - the scaffolding around the model that determines whether an agent actually works on real tasks.
The framing the maintainers picked is the right one: harness engineering is the discipline of designing context delivery, tool interfaces, planning artifacts, verification loops, memory systems, and sandboxes. Models can't do these things alone, and the best harnesses are designed knowing those components will become unnecessary as models improve. That insight is what keeps the list from becoming a frameworks dump.
How it's organized
The list is structured around the problem each component solves, not the vendor that built it. The top-level sections:
- Foundations - canonical essays from OpenAI, Anthropic, Google, Microsoft, Meta, Red Hat, LangChain, Martin Fowler that define what harness engineering actually is.
- Design Primitives - the components a harness is composed of:
- Agent Loop - ReAct, LangGraph, the Codex agent loop, middleware
- Planning & Task Decomposition - Plan.md / Implement.md patterns, plan-and-execute, multi-agent topologies
- Context Delivery & Compaction - what the agent sees, when, and how it shrinks
- Tool Design - schemas, naming, error surfaces (the "tool design is agent UX" school)
- Skills & MCP - protocol-level integration
- Permissions & Authorization - structured permission systems vs natural-language prompts
- Memory & State - episodic, long-term, cross-session
- Task Runners & Orchestration - the pieces that drive multiple agents
- Verification & CI Integration - getting the agent to check its own work
- Observability & Tracing - knowing what happened
- Debugging & Developer Experience - inspecting the trace
- Human-in-the-Loop - approval flows and intervention triggers
- Reference Implementations - tutorials, generators/meta-harnesses, demo harnesses, adjacent collections.
- Security, Sandbox & Permissions - the layer most teams under-invest in until it bites.
- Evals & Verification - measuring what you've built.
- Templates - drop-in artifacts.
Each entry is annotated with what makes it worth reading, not just a one-line description. This is the part that's hard to maintain and the reason the list is more useful than a search-engine query for the same terms.
Why "harness engineering" as a separate discipline
Three pieces of writing in the Foundations section explain it best:
- OpenAI's "Harness Engineering" - the framing piece. Defines harness engineering as the design of the scaffolding that lets agents operate reliably.
- Martin Fowler's synthesis - reframes the discipline as three interlocking systems: context engineering (curating what the agent knows), architectural constraints (deterministic linters and structural tests), and entropy management (periodic agents that repair documentation drift). The "humans on the loop" framing is the clearest conceptual map of what the discipline actually is.
- LangChain's "Anatomy of an Agent Harness" - structural breakdown into five primitives: filesystem, code execution, sandbox, memory, context management. Includes the co-evolution warning: models trained against specific harnesses can become overfitted to those designs - a reason architecture choices have lasting consequences.
If you've felt the difference between "the model is smart but the agent is unreliable" and "the model is the same and now the agent works," you've been doing harness engineering whether you called it that or not.
The papers worth your time
The list pulls together a surprising amount of recent peer-reviewed and industry research:
- "Building AI Coding Agents for the Terminal" - the first systematic practitioner paper on terminal-native coding agent harness design. Eager-construction scaffolding, compound multi-model architectures, schema-filtered planning subagents.
- "A Scheduler-Theoretic Framework for LLM Agent Execution" (April 2026) - 70 open-source agent projects analysed; 60% adopt the Agent Loop pattern. Maps execution patterns onto a unified control model so the trade-offs become explicit.
- "The Design Space of Today's and Future AI Agent Systems" - reverse-engineering of Claude Code: five-stage progressive compaction, subagent isolation with rebuilt permission contexts, 27-event-type hook pipeline.
- "Improving Deep Agents with Harness Engineering" (LangChain case study) - harness-only changes moved their coding agent from rank 30 to top 5 on Terminal Bench 2.0 with no model swap. The strongest published demonstration that harness design is the primary performance lever.
- Microsoft's Azure SRE Agent - 35,000+ production incidents handled autonomously, time-to-mitigation cut from 40.5 hours to 3 minutes. Most data-backed production case study published in 2026.
That's not an exhaustive sample. It's the kind of mix the list is good at.
When to use it
- You're building a serious agent system and want to know what's been tried before you spend a quarter rediscovering it.
- You're catching up after a few months away - the list moves fast and the recent additions are usually the most interesting.
- You're hiring or onboarding for harness work and want a reading list that isn't "skim our docs."
When it's not the right resource
- You want a quick API tutorial. This is a depth resource, not a how-to.
- You're looking for marketing-style recommendations between specific frameworks. The list deliberately classifies by problem solved, not vendor.
Practical notes
CC0 licensed. The maintainers actively curate - check the commit log to confirm freshness on whatever section you're reading. Translations exist in nine languages on zdoc.app. The list is hosted on GitHub with the standard awesome-list contribution path: open a PR with the entry and a real annotation, not a one-liner.
If you only have time for one resource on this page, start with Anthropic's "Harness Design for Long-Running Application Development" or OpenAI's "Unrolling the Codex Agent Loop." Either gives you a vocabulary you'll keep using.
Recent discussion
From the wider webAgent-Analytics/awesome-multi-agent-orchestrators
github.com · Apr 29, 2026
njulj/Awesome-Agent-Based-Low-Level-Vision
github.com · Apr 29, 2026
peter123023/awesome-claude-api
github.com · Apr 29, 2026
rbbydotdev/awesome-just-bash
github.com · Apr 29, 2026
YouMind-OpenLab/awesome-gemini-3-prompts
github.com · Apr 29, 2026
Featured in
MCP servers and Model Context Protocol tools
Production MCP servers, gateways, frameworks, and clients - everything in this directory that speaks the Model Context Protocol.
Memory and knowledge graphs for AI agents
Memory layers, knowledge graphs, and persistent context stores for agents - the substrate underneath useful long-running systems.
Related entries
mcptube - Karpathy-style LLM wiki for YouTube
MCP server that turns YouTube videos into a persistent, merging wiki rather than ephemeral vector chunks. Scene-change frame extraction + vision analysis captures slides, code, and diagrams that transcripts miss. 25+ MCP tools, FTS5+LLM hybrid retrieval, version history with source attribution per claim.
stash - persistent memory layer for AI agents
Self-hosted single-binary Go service that stores episodes, facts, and working context for AI agents in Postgres. Ships an MCP server and runs without cloud dependencies.
Wax - on-device RAG memory layer in Swift
Single-file Swift memory layer with sub-millisecond RAG on Apple Silicon via CoreML and Metal. No server, no API, ships an MCP server out of the box.
claude-historian-mcp - search past Claude Code chats
MCP server that indexes Claude Code conversation history and exposes it as a searchable tool. Lets agents recall what was decided in past sessions instead of re-deriving it.