M-flow - relevance-first agent memory
Memory layer for agents that surfaces relevant context instead of just similar embeddings. Episodic plus long-term memory backed by a knowledge graph.
The one-line pitch from the README is unusually precise: "RAG matches chunks. GraphRAG structures context. M-flow scores evidence paths." That's the right framing because M-flow doesn't sit in the same lane as either approach - it changes what the graph is allowed to do at retrieval time.
Most GraphRAG systems use the graph to organize, summarize, or expand context, but the actual scoring is still vector similarity. M-flow inverts that: vectors are the entry points, and the graph is the scoring engine.
The core idea: relevance is a path, not a score
Similarity is proximity in representation space. Relevance is whether the system can connect the query to the answer through a coherent structure of evidence. The two overlap, but they're not the same.
The README's example: the query "Why was Maria upset at Monday's standup?"
A traditional retriever sees "standup" + "upset" + "team" and surfaces a generic article on running effective standups. The keywords overlap. The actual cause - that Maria was blindsided by a weekend deadline change - is in a different document with no shared keywords.
M-flow stores knowledge in a four-layer cone graph: Episode → Facet → FacetPoint → Entity. The query lands on the layer matching its granularity (a precise cue like "I wasn't told about the deadline" anchors on a FacetPoint), then graph propagation routes up through the Facet to the Episode it belongs to. What's returned is the Episode bundle - one coherent unit of context, scored by its strongest supporting path of evidence.
One strong path is enough. The way a single association can trigger an entire memory.
Four layers, one graph
| Level | What it captures | Query example |
|---|---|---|
| Episode | Bounded semantic focus - an incident, decision process, workflow | "What happened with the tech stack decision?" |
| Facet | One dimension of an Episode - a topical cross-section | "What were the performance targets?" |
| FacetPoint | An atomic assertion or fact derived from a Facet | "Was the P99 target under 500ms?" |
| Entity | A named thing - person, tool, metric - linked across all Episodes | "Tell me about GPT-4o" - surfaces all related contexts |
The point of the cone graph isn't multi-level storage - several memory systems already do that. It's multi-level retrieval. A precise query can enter through a FacetPoint and walk up to its Episode. A thematic query enters through an Episode summary. An entity-centred query can bridge multiple Episodes through the same Entity node. The user doesn't have to know which layer to query.
Six properties of the retriever
Worth listing because they're each distinct from how a chunked-RAG or vanilla GraphRAG system would behave:
- Graph-led, not similarity-led - vectors only open entry points; the graph decides relevance.
- Evidence-path scoring - results are ranked by the strongest supporting path, not flat similarity. Path-cost optimization over the graph.
- Unified multi-granularity - Episodes, Facets, FacetPoints, and Entities are all entry points, connected in one graph.
- Semantic edges as first-class signals - edges carry natural-language
edge_text; relationships are searchable and scored, not just structural. - Controlled propagation - each hop expands context but adds cost. Only coherent, low-cost paths survive. Not a naive walk.
- Adaptive and noise-resistant - broad matches get penalized so "looks relevant" doesn't beat "is relevant"; node and edge importance adapts per query.
Coreference resolution at ingestion (the underrated detail)
This is the part most memory systems quietly skip and then quietly fail because of.
When a stream of turns says "Maria raised the deadline issue at Monday's standup" and then "She said she wasn't told about the change," a chunked retriever will index Turn 2 as a separate document with no Maria token. A later query "What did Maria say about the deadline?" finds Turn 1 and silently misses Turn 2 - the evidence is invisible because the anchor is missing.
M-flow resolves pronouns at ingestion time. Turn 2 gets rewritten to "Maria said Maria wasn't told about the change" before indexing, so the same Entity bridge picks up both turns. This is the kind of fix that doesn't show up in headline benchmarks but changes the failure mode in production.
Why this is interesting beyond chatbots
The README's framing - "operates like a cognitive memory system" - is more concrete than it sounds. The path-cost analogy is recall: thinking of classmate A surfaces "A grew up in California," which opens the wider neighbourhood of California-related memories, which makes "Lakers fan" a low-cost next association. Classical RAG can't model that chain because it has no structural notion of cost - only similarity. M-flow does.
For long-running agents, this matters. The agent needs to be able to ask "why did we decide X?" three weeks after the decision and get back the Episode that contains the trade-offs, not the closest-matching paragraph from the meeting notes. That's what evidence-path retrieval actually buys you.
Quick start
Apache 2.0 licensed, Python 3.10–3.13. Install via pip:
pip install m-flow
Examples and the full retrieval architecture write-up live in the repo - the docs/RETRIEVAL_ARCHITECTURE.md doc is where the path-cost mechanism is spelled out in detail. There's also an OpenClaw skill listed (mflow-memory) for plugging M-flow into the Claw ecosystem directly.
When to reach for it
- Long-running agents where memory matters across sessions and conversations.
- Retrieval workloads where similarity-only matching is producing "looks relevant" misfires.
- Knowledge bases with internal structure (incidents, decision logs, project histories) that traditional RAG flattens away.
When not to
- Pure factual lookup over short documents. Plain vector RAG is simpler and good enough.
- Workloads with no entity overlap or repeated structure - the graph adds value when there's something to traverse.
- Sub-millisecond latency budgets. Path-cost retrieval is more work than a single-vector ANN lookup; the trade-off is precision for speed.
Limits worth knowing
The retrieval quality depends on ingestion quality - if the cone graph isn't well-formed (Episodes vague, Facets bloated, FacetPoints duplicated), the path-cost mechanism has nothing solid to walk. Plan for an iterate-on-ingestion phase, not just an iterate-on-prompts phase.
The 963-test test suite the README highlights is a useful signal that the implementation is being maintained seriously, but production deployment still wants you to read the architecture doc end-to-end before betting a system on it.
Featured in
Related entries
mcptube - Karpathy-style LLM wiki for YouTube
MCP server that turns YouTube videos into a persistent, merging wiki rather than ephemeral vector chunks. Scene-change frame extraction + vision analysis captures slides, code, and diagrams that transcripts miss. 25+ MCP tools, FTS5+LLM hybrid retrieval, version history with source attribution per claim.
wuphf - Karpathy-style LLM wiki for agents
Local wiki layer where agents read and write Markdown under git, indexed with bleve (BM25) and SQLite. Designed so context compounds across sessions instead of being re-pasted.
atlas-mcp-server - Neo4j task system for agents
MCP server backed by Neo4j that gives LLM agents project, task, and knowledge tiers for managing complex multi-step workflows, with deep-research mode.
iwe - Markdown memory graph for AI agents
Rust CLI that turns a Markdown notes folder into a queryable knowledge graph for both you and your agent, with Helix integration and GTD primitives.