awesome-harness-engineering - agent harness toolkit

Curated awesome list for AI agent harness engineering: tools, patterns, evals, memory, MCP, permissions, observability, and orchestration.

Saved Apr 25, 20264 min readView source ↗

#awesome-list #agent-harness #mcp #evals #agent-memory

There are many "awesome AI agent" lists. Most of them collect frameworks. This one collects something more specific and more useful: the harness - the scaffolding around the model that determines whether an agent actually works on real tasks.

The framing the maintainers picked is the right one: harness engineering is the discipline of designing context delivery, tool interfaces, planning artifacts, verification loops, memory systems, and sandboxes. Models can't do these things alone, and the best harnesses are designed knowing those components will become unnecessary as models improve. That insight is what keeps the list from becoming a frameworks dump.

How it's organized

The list is structured around the problem each component solves, not the vendor that built it. The top-level sections:

Foundations - canonical essays from OpenAI, Anthropic, Google, Microsoft, Meta, Red Hat, LangChain, Martin Fowler that define what harness engineering actually is.
Design Primitives - the components a harness is composed of:
- Agent Loop - ReAct, LangGraph, the Codex agent loop, middleware
- Planning & Task Decomposition - Plan.md / Implement.md patterns, plan-and-execute, multi-agent topologies
- Context Delivery & Compaction - what the agent sees, when, and how it shrinks
- Tool Design - schemas, naming, error surfaces (the "tool design is agent UX" school)
- Skills & MCP - protocol-level integration
- Permissions & Authorization - structured permission systems vs natural-language prompts
- Memory & State - episodic, long-term, cross-session
- Task Runners & Orchestration - the pieces that drive multiple agents
- Verification & CI Integration - getting the agent to check its own work
- Observability & Tracing - knowing what happened
- Debugging & Developer Experience - inspecting the trace
- Human-in-the-Loop - approval flows and intervention triggers
Reference Implementations - tutorials, generators/meta-harnesses, demo harnesses, adjacent collections.
Security, Sandbox & Permissions - the layer most teams under-invest in until it bites.
Evals & Verification - measuring what you've built.
Templates - drop-in artifacts.

Each entry is annotated with what makes it worth reading, not just a one-line description. This is the part that's hard to maintain and the reason the list is more useful than a search-engine query for the same terms.

Why "harness engineering" as a separate discipline

Three pieces of writing in the Foundations section explain it best:

OpenAI's "Harness Engineering" - the framing piece. Defines harness engineering as the design of the scaffolding that lets agents operate reliably.
Martin Fowler's synthesis - reframes the discipline as three interlocking systems: context engineering (curating what the agent knows), architectural constraints (deterministic linters and structural tests), and entropy management (periodic agents that repair documentation drift). The "humans on the loop" framing is the clearest conceptual map of what the discipline actually is.
LangChain's "Anatomy of an Agent Harness" - structural breakdown into five primitives: filesystem, code execution, sandbox, memory, context management. Includes the co-evolution warning: models trained against specific harnesses can become overfitted to those designs - a reason architecture choices have lasting consequences.

If you've felt the difference between "the model is smart but the agent is unreliable" and "the model is the same and now the agent works," you've been doing harness engineering whether you called it that or not.

The papers worth your time

The list pulls together a surprising amount of recent peer-reviewed and industry research:

"Building AI Coding Agents for the Terminal" - the first systematic practitioner paper on terminal-native coding agent harness design. Eager-construction scaffolding, compound multi-model architectures, schema-filtered planning subagents.
"A Scheduler-Theoretic Framework for LLM Agent Execution" (April 2026) - 70 open-source agent projects analysed; 60% adopt the Agent Loop pattern. Maps execution patterns onto a unified control model so the trade-offs become explicit.
"The Design Space of Today's and Future AI Agent Systems" - reverse-engineering of Claude Code: five-stage progressive compaction, subagent isolation with rebuilt permission contexts, 27-event-type hook pipeline.
"Improving Deep Agents with Harness Engineering" (LangChain case study) - harness-only changes moved their coding agent from rank 30 to top 5 on Terminal Bench 2.0 with no model swap. The strongest published demonstration that harness design is the primary performance lever.
Microsoft's Azure SRE Agent - 35,000+ production incidents handled autonomously, time-to-mitigation cut from 40.5 hours to 3 minutes. Most data-backed production case study published in 2026.

That's not an exhaustive sample. It's the kind of mix the list is good at.

When to use it

You're building a serious agent system and want to know what's been tried before you spend a quarter rediscovering it.
You're catching up after a few months away - the list moves fast and the recent additions are usually the most interesting.
You're hiring or onboarding for harness work and want a reading list that isn't "skim our docs."

When it's not the right resource

You want a quick API tutorial. This is a depth resource, not a how-to.
You're looking for marketing-style recommendations between specific frameworks. The list deliberately classifies by problem solved, not vendor.

Practical notes

CC0 licensed. The maintainers actively curate - check the commit log to confirm freshness on whatever section you're reading. Translations exist in nine languages on zdoc.app. The list is hosted on GitHub with the standard awesome-list contribution path: open a PR with the entry and a real annotation, not a one-liner.

If you only have time for one resource on this page, start with Anthropic's "Harness Design for Long-Running Application Development" or OpenAI's "Unrolling the Codex Agent Loop." Either gives you a vocabulary you'll keep using.

Recent discussion

From the wider web

Featured in

Related entries

GitHub ToolFeatured

mcptube - Karpathy-style LLM wiki for YouTube

MCP server that turns YouTube videos into a persistent, merging wiki rather than ephemeral vector chunks. Scene-change frame extraction + vision analysis captures slides, code, and diagrams that transcripts miss. 25+ MCP tools, FTS5+LLM hybrid retrieval, version history with source attribution per claim.

Why I saved this - The wiki-merge design is the differentiator vs RAG-over-YouTube clones - one MCP article with citations, not ten near-duplicate chunks. Scene-change extraction is what makes visual-heavy talks usable.

#mcp #claude-code #knowledge-graph #agent-memory #python

GitHub Tool

stash - persistent memory layer for AI agents

Self-hosted single-binary Go service that stores episodes, facts, and working context for AI agents in Postgres. Ships an MCP server and runs without cloud dependencies.

#mcp #agent-memory #go #self-hosted

GitHub Library

Wax - on-device RAG memory layer in Swift

Single-file Swift memory layer with sub-millisecond RAG on Apple Silicon via CoreML and Metal. No server, no API, ships an MCP server out of the box.

#mcp #agent-memory #macos

GitHub Tool

claude-historian-mcp - search past Claude Code chats

MCP server that indexes Claude Code conversation history and exposes it as a searchable tool. Lets agents recall what was decided in past sessions instead of re-deriving it.

#mcp #claude-code #agent-memory #typescript