mcptube - Karpathy-style LLM wiki for YouTube
MCP server that turns YouTube videos into a persistent, merging wiki rather than ephemeral vector chunks. Scene-change frame extraction + vision analysis captures slides, code, and diagrams that transcripts miss. 25+ MCP tools, FTS5+LLM hybrid retrieval, version history with source attribution per claim.
mcptube takes Karpathy's "your agent maintains an LLM wiki" idea and points it at YouTube. Each video you ingest doesn't go into a vector store as ephemeral chunks - it gets turned into entities, topics, and concepts that merge with everything already in the wiki. Watch ten talks about MCP and you end up with a coherent MCP article, not ten unconnected transcripts.
The other half of the design is treating video as video, not just transcript. Scene-change detection picks frames where something actually changed (a new slide, a code panel, a diagram), and a vision pass extracts the visual content most transcript-only tools miss.
The architecture worth understanding
Three components stacked:
- Ingest layer - downloads + transcribes + scene-change frame detection
- WikiEngine - merges new content into existing entities; keeps version history; preserves source attribution per claim
- MCP server - exposes 25+ tools for query, edit, and discovery
The merge step is what differentiates this from "another RAG over YouTube." A vector chunk store with ten MCP talks gives you ten near-duplicate chunks. mcptube's wiki gives you one MCP article with citations to ten sources - and when you ask "how do MCP servers handle auth?" it answers from the merged article, not from whichever chunk happens to score highest on cosine similarity.
Tools the MCP exposes
The headline subset:
add_video,list_videos,discover_videos- ingest and inventorywiki_list,wiki_show,wiki_search,wiki_ask- read-sideget_frame,classify_video,generate_report- the vision-aware bits- Plus a bunch more for batch ops, attribution lookup, and history
Hybrid retrieval: FTS5 keyword search narrows the candidate set, then the LLM reasons over the narrowed view. This is the right shape for a wiki - keyword search for "where does this concept live" plus LLM reasoning for "what does it actually say" - and avoids the embedding-only failure mode of confidently retrieving the wrong chunk.
Install
pipx install mcptube --python python3.12
mcptube --help
Requires Python 3.12+ and ffmpeg on the path. MCP client integrations for Claude Desktop, VS Code Copilot, Cursor, Windsurf, and Gemini CLI are wired up; standard MCP config blocks for the rest.
Why scene-change beats fixed-interval frame extraction
The lazy approach - sample one frame every 30 seconds - misses content. A talk where the speaker pulls up a code panel for 8 seconds, scrolls through three diffs, and switches back to slides will lose all three diffs to the sampling cadence.
Scene-change detection catches them. Visual content (code, slides, architecture diagrams, terminal output) survives into the wiki, which is the difference between "a transcript with timestamps" and "a knowledge base that knows what was on screen."
When to reach for it
- You watch a lot of technical talks and want them to compound into something searchable. The merge-into-wiki design is the right shape for that.
- You ingest video where the visual matters - conference talks with slides, code walkthroughs, architecture reviews. Scene-change extraction is what makes those usable.
- You want an MCP that does something model-native rather than wrapping an existing API. The WikiEngine is the differentiated piece.
When not to
- You want to summarise a single video. mcptube is overkill for one-off summarisation; standard transcript tools or yt-dlp + an LLM call do that fine.
- Your videos are talking-head only (podcasts, interviews). Without visual content, you're paying for vision that surfaces nothing.
- You need real-time results. Ingestion is batch - download, transcribe, scene-detect, merge - and the wiki compounds in value over time, not on the first video.
Trade-offs
The wiki merge is the value and also the failure mode. If the merge step gets a fact wrong, that wrong fact propagates - subsequent queries see the merged article, not the source. Version history is on by default, which lets you audit, but you do need to actually look. Trust-and-verify, especially on the first few videos.
ffmpeg + scene-change detection is heavier than transcript-only ingestion. A 90-minute talk takes real wall-clock time to process. Don't expect "drop a URL, get answers in 10 seconds" - the ingest pipeline is where the latency lives.
The wiki is local. Persistent, but not multi-user out of the box; if you want a team-shared wiki you're wiring sync up yourself. For a single operator that's fine; for "the team's video knowledge base" it's a project.
Featured in
Claude Code tools, plugins, and integrations
The best tools, MCP servers, and harnesses for getting more out of Claude Code - orchestration, observability, telemetry, and remote control.
MCP servers and Model Context Protocol tools
Production MCP servers, gateways, frameworks, and clients - everything in this directory that speaks the Model Context Protocol.
Memory and knowledge graphs for AI agents
Memory layers, knowledge graphs, and persistent context stores for agents - the substrate underneath useful long-running systems.
Related entries
MemoMind - local GPU-accelerated memory for Claude Code
Local-first memory system for Claude Code with GPU acceleration and zero cloud dependency. Provides persistent agent memory via MCP.
claude-historian-mcp - search past Claude Code chats
MCP server that indexes Claude Code conversation history and exposes it as a searchable tool. Lets agents recall what was decided in past sessions instead of re-deriving it.
vestige - cognitive memory MCP server
Single-binary Rust MCP server that gives agents long-term memory via FSRS-6 spaced repetition, 29 cognitive modules, and a 3D dashboard. Works with Claude, Cursor, JetBrains.
idea-reality-mcp - prior-art MCP for coding agents
MCP server that scans GitHub, Hacker News, npm, PyPI, and Product Hunt before an agent starts building, surfacing whether the idea already exists and at what scale.