Source: ai-research/synthadoc-axoviq-readme.md Repo: github.com/axoviq-ai/synthadoc Stars: 231 / License: AGPL-3.0 / Released: v0.3.0 on 2026-05-04 Languages: Python 85.7% / TypeScript 14.3%
A formalized engine implementation of the Karpathy LLM-wiki pattern — an open-source Python service that compiles raw documents into structured local Markdown wikis at ingest time, with explicit contradiction detection, orphan flagging, and a job-queue/audit-DB architecture. Released as v0.3.0 on 2026-05-04 by axoviq-ai. The README opens by quoting Karpathy’s gist directly: “The LLM should be able to maintain a wiki for you.”
Compared to the Stride starter vault (a minimal Obsidian template with a 4-operation CLAUDE.md), synthadoc is the architecturally complete engine end of the spectrum — Python service + HTTP API + background job worker + multi-LLM provider abstraction + SQLite audit DB + OpenTelemetry hooks.
Key Takeaways
- Ingest-time synthesis, not query-time RAG. The README frames this as the project’s core differentiation: “compiles knowledge at ingest time. Every new source enriches and cross-links the entire corpus, not just appends a new chunk.” Same thesis as Karpathy’s gist and as this vault.
- 5-pass IngestAgent pipeline. Vision → Analysis → Candidate search (BM25) → Decision → Write. Pass 3 reads the analysis summary plus retrieved candidates plus AGENTS.md scope and can output
flag_contradiction, which transitions the page’s frontmatterstatustocontradictedand preserves both old and new claims with⚠markers. - Frontmatter
statusfield.active | contradicted | archived— a single field that lets Dataview, lint, and audit queries find conflicted pages without scanning bodies. This vault adopted the field on 2026-05-05 (see Karpathy wiki additions from synthadoc). - Query decomposition.
search_decompose_agent.pysplits compound questions into 1-N sub-questions (cap=4) and runs them in parallel. Identical pattern for web search via Tavily. Avoids the relevance-collision failure where multi-entity questions only return articles about the most-frequent entity. This vault adopted the pattern as a Query operation behavior on the same date. - AGENTS.md per wiki. A file containing “LLM instructions for this domain” prepended to ingest decision prompts. Scope-based filtering without per-agent branching. This vault adopted it as an optional per-topic file, not a single global one — see the AGENTS.md section in vault
CLAUDE.md. - Hooks system. Shell commands triggered on
on_ingest_completeandon_lint_completeevents with a JSON payload on stdin (event, wiki, source, pages_created, pages_updated, tokens, cost_usd). Blocking or non-blocking. - Three-layer cache. Embedding + LLM response + provider prompt cache. Each layer addresses a different repeat-cost class. Less load-bearing for our setup since we run on Claude Code subscription, but the layering insight transfers.
- Audit trail. Three artifacts: human-readable
log.md, JSON-linessynthadoc.log(rotates by size, jq-filterable), and append-onlyaudit.dbSQLite. Tables:ingest_log,audit_events,queries. Optional OpenTelemetry OTLP backend for traces/metrics. This vault now generates a similar.audit.dbviabin/build-audit-dbfrom the existing.manifest.json+log.md+questions.md. - Obsidian plugin. TypeScript plugin built into the repo for native Obsidian integration alongside the CLI. Auto-generated Dataview dashboard for any new wiki.
- Multi-source ingest. PDF, PPTX, XLSX, OCR images, Markdown, URLs, YouTube transcripts, Tavily web searches, plus a manifest-file batch format. Same content surface as our
raw/+ai-research/ingest. - Multi-LLM support. 7 providers — Gemini Flash (free 1M tokens/day), Groq (free, rate-limited), Ollama (local), MiniMax, DeepSeek, Anthropic, OpenAI. Plus Claude Code / Opencode CLI subscriptions as zero-API-key providers (the model their docs nudge for the free tier).
- Auto-resolution at ≥85% confidence. LintAgent auto-resolves contradictions above the threshold; below it the conflict stays flagged with
status: contradictedfor human review. This is the part of synthadoc’s discipline we did NOT adopt — every contradiction stays for human resolution in our flow.
Why this matters for our wiki
Synthadoc’s release on 2026-05-04 is the most architecturally complete public reference for the Karpathy pattern to date. It validates several of our existing choices (ingest-time synthesis, append-only operation log, contradiction callouts, Obsidian as the IDE) and surfaces specific patterns we’d otherwise have had to invent independently. Five concrete improvements landed in this vault on 2026-05-05 directly informed by reading their docs/design.md:
status: active | contradicted | archivedfrontmatter field- Query decomposition behavior in the Query operation
bin/lint-stale-sourcesPython script (modeled on their LintAgent stale check)bin/build-audit-dbSQLite audit DB build script (modeled on theiraudit.dbschema)- Optional per-topic
AGENTS.mdpattern
See Karpathy wiki additions from synthadoc for the full delta and rationale.
Compared to Stride starter vault
| Stride starter | Synthadoc | This vault | |
|---|---|---|---|
| Form factor | Obsidian template | Python engine + plugin | Obsidian vault + Quartz site |
| Operations defined | 4 (Ingest/Research/Query/Lint) | ~10 (CLI commands) | 11 (vault CLAUDE.md) |
| Contradiction handling | Mentioned in CLAUDE.md | status: contradicted + auto-resolve | [!contradiction] callout + status: contradicted |
| Audit trail | log.md only | log.md + JSON-lines + SQLite | log.md + JSON manifest + SQLite (new) |
| Job queue | None | Background worker, retryable | Claude Code session-bound |
| Multi-LLM | Implicit (whatever runs CLAUDE.md) | 7 providers + 2 CLI subscriptions | Claude Code only |
| Live publishing | None | None | Quartz → Cloudflare Worker |
| Stars | 9 | 231 | private |
Try It
- Read their
docs/design.md— the most substantive Karpathy-pattern engineering writeup public to date. Especially the IngestAgent pass-by-pass description and the audit-DB schema. - Compare AGENTS.md vs our topic
_index.md— our index files are descriptive (what’s in the topic); theirs are directive (how to ingest into the topic). Decide per topic whether a directive file adds enough value to maintain. - Steal their
--demoinstall pattern —synthadoc install history-of-computing --demoships 13 prebuilt pages + bootstrap scaffold. We don’t currently have a way to publish a “starter” version of this vault for someone wanting to clone the pattern. Their pattern shows how. - Ignore their three-layer cache for now — embedding cache and LLM-response cache require running the model directly. We use Claude Code as the runtime, which already does provider prompt cache. Layers 1 and 2 only become relevant if we ever swap to a programmatic provider.
- Watch for v0.4 — released 2026-05-04, so still moving fast. The
axoviq-ai/synthadocrepo is worth a quarterly check.
Open Questions
- How does their auto-resolution at ≥85% confidence actually work in practice? The README mentions the threshold but
docs/design.md§ 4 doesn’t expose the rubric. Worth a deeper read. - What does
audit_eventstrack that’s not already iningest_logorqueries? The schema is referenced but not enumerated. - Is the Obsidian plugin shipped as a bundled
.zipfor community-plugin install, or only buildable from source?
Related
- joshpocock-vault — minimal Obsidian-template implementation of the same pattern (the other end of the spectrum)
- from-vibe-coding-to-agentic-engineering — Karpathy’s Sequoia talk that explicitly endorses LLM knowledge bases as understanding tools
- wiki-community-enhancements — broader ecosystem survey of Karpathy-pattern variants
- karpathy-techniques-for-claude-code — applying Karpathy’s patterns to Claude Code specifically
- karpathy-vault-additions-from-synthadoc — the specific 5 improvements this vault adopted on 2026-05-05 from reading synthadoc’s design doc