QMD is Tobias Lütke’s (Shopify CEO) local-first hybrid-search engine for markdown knowledge bases — BM25 + vector + LLM-reranking with query expansion, all running on-device via node-llama-cpp + GGUF models. Three CLI commands (qmd search BM25, qmd vsearch vector, qmd query hybrid + LLM rerank) plus an MCP server expose query / get / multi_get / status to Claude Code, Claude Desktop, Hermes Agent, and any other MCP client. ~2GB of models auto-download on first run; queries return in ~2-3 seconds with models warm. MIT, TypeScript, 24,467 stars / 1,539 forks at fetch (2026-05-09), created 2025-12-08, last push 2026-05-03. This wiki runs on QMD as its primary retrieval layer — installed 2026-05-04, indexes the wiki/ collection (currently 298 docs / 8,497 vectors), and the wiki’s CLAUDE.md instructs every Query / Lint / Cross-link / file-back-chain operation to call mcp__qmd__qmd_query before falling back to grep. The article exists to canonicalize that dependency.
Key Takeaways
Author identity is not incidental. Tobias Lütke is Shopify’s CEO. He posted QMD to Hacker News personally as xal in February 2026: “hi orange page, author here. QMD is an implementation of the best practices that I picked up in meetings with teams that work in search and retrieval. I tried to make it not overkill and keep things local.” Sole maintainer; the project is explicitly his side-project hobby. Adoption follows the author — Addy Osmani recommended it on LinkedIn, Hermes Agent (Nous Research / Teknium) ships an official skill wrapper, Raycast has an extension, the pi coding agent has an extension. The Shopify-CEO factor is what made this go from a 25-point HN post to 24k stars in 5 months.
Hybrid pipeline is the substantive engineering claim. Five stages: (1) LLM query expansion (1.7B fine-tuned model) generates 2 query variants alongside the original. (2) Parallel BM25 + vector retrieval for each of the 3 query variants → 6 ranked lists. (3) Reciprocal Rank Fusion with k=60, original-query weight ×2, top-rank bonus (+0.05 for #1, +0.02 for 2-3) → top 30 candidates. (4) LLM re-ranking via qwen3-reranker-0.6b (yes/no classification with logprobs). (5) Position-aware blending — Top 1-3 trust 75% RRF / 25% reranker; Top 4-10 trust 60/40; Top 11+ trust 40/60. The position-aware blend is the “exact keyword matches don’t get buried by the reranker” insurance policy — a real failure mode in pure-rerank pipelines.
AST-aware chunking via tree-sitter for code files. Not just markdown boundary detection (headings, code fences, paragraphs at ~900 tokens with 15% overlap) — for code files in TypeScript / TSX / JavaScript / Python / Go / Rust, QMD parses the AST and chunks at function and class boundaries. Same engineering choice GitNexus and Graphify make for code analysis, but applied to retrieval rather than topology.
All-local. No API calls. ~2GB of models auto-download once.
Embeddings:embeddinggemma-300M (default; 300M params). QMD_EMBED_MODEL env var swaps in alternatives — Qwen3-Embedding covers 119 languages including CJK for multilingual vaults.
All three run via node-llama-cpp + GGUF format. No OpenAI key, no Anthropic key, no telemetry.
Index storage = SQLite + FTS5. BM25 implemented via SQLite’s FTS5 virtual-table extension — explains the Homebrew SQLite required on macOS prerequisite (Apple’s bundled SQLite ships without FTS5 / loadable extensions). Vectors stored alongside in the same SQLite DB. Single-file index is portable, backup-friendly, and trivially git-trackable if you want.
Three operating modes serve three different use cases.
qmd search — pure BM25, no models loaded, near-instant. Use for exact-string lookups and code identifiers.
qmd vsearch — vector-only, ~3 seconds with models warm. Use for “I know I wrote about this but can’t remember the exact words.”
qmd query — full hybrid pipeline (expansion → BM25+vec → RRF → rerank → blend), ~2-3 seconds warm. The recommended default. The MCP query tool accepts a structured searches array with per-sub-query type (lex / vec / hyde) — the agent picks the retrieval mode per sub-query. This enables the synthadoc-style decomposition pattern the wiki’s own CLAUDE.md Query operation now follows.
MCP tool surface is deliberately minimal. Four tools — query, get, multi_get, status. Compare to GitNexus’s 16 MCP tools and Graphify’s 4. QMD’s surface is small because the retrieval pipeline does the heavy lifting; the agent’s job is to ask questions, not to specify retrieval strategy. The query tool’s structured-JSON interface (searches: [{type, query}]) is where the expressiveness lives.
MCP server:qmd mcp starts stdio transport (for Claude Code / Hermes Agent / any local MCP client). HTTP transport also supported for shared, long-lived model loading across requests — recommended for “frequent users” because models stay warm in memory for consistent ~2-3 second query latency.
Multi-collection support is first-class, not bolted-on.qmd collection add lets you index multiple folders separately under different names; queries can target one collection (collections: ["wiki"]) or run across all. This user runs two collections — karpathy-wiki (298 files) and weomarketly-wiki (113 files) — and the MCP server’s session-start instructions distinguish them explicitly. Means one QMD install spans personal vault + agency vault + project docs without separate indexes per workspace.
qmd context add attaches human-written summaries to collections / paths. A descriptive metadata layer that helps the LLM reranker understand what kind of content this collection is, improving relevance scoring. Worth populating once per collection — otherwise the reranker is judging in the abstract.
Bundled SKILL.md ships in the npm package.qmd skill show displays it; qmd skill install writes to ~/.claude/skills/. The skill is MIT, by @tobi, version 2.0.0, with allowed-tools: Bash(qmd:*), mcp__qmd__*. Self-installing skill is a clean pattern — most third-party tools require manual SKILL.md authoring; QMD ships its own.
Output formats --json and --files are designed for agentic workflows.--json returns structured results (path, score, snippet, metadata); --files returns just the path list, suitable for piping into other commands. Same pattern last30days-skill uses for agent consumption — “the CLI’s output format is the agent’s API.”
Position-aware blending is a non-obvious calibration that prevents a real failure mode. Pure-LLM-rerank pipelines tend to bury exact keyword matches under semantically-similar-but-wrong results, because the reranker doesn’t know the user actually meant that string. QMD’s tiered RRF/reranker weighting (75% RRF for top-3, 60% for 4-10, 40% for 11+) means: at the top of the results, lexical signal dominates; further down, semantic signal dominates. Engineering taste call — most open-source RAG systems don’t do this and ship worse default ranking as a result.
Performance discipline — --max-docs-per-batch and --max-batch-mb for embedding generation. Memory-bounded indexing at scale. Important when first-indexing a 1000-document vault on a laptop — without these caps the embedding step OOMs the Node process.
qmd bench <fixture.json> is a search-quality testbench. Run benchmarks against a fixture file to evaluate retrieval quality after model swaps or index tweaks. Eval-first discipline shipped in the CLI — most retrieval tools don’t include their own quality eval; QMD does.
Why this article exists — wiki retrieval substrate
This wiki has used QMD as its primary retrieval layer since 2026-05-04. The wiki’s CLAUDE.md § Wiki Retrieval enforces:
qmd query "<question>" first. QMD is installed as an MCP server (mcp__qmd__* tools) and indexes the wiki/ directory only — collection name karpathy-wiki, 297 docs / 1597 chunks as of 2026-05-04. Hybrid BM25+vector+LLM-rerank, fully local, free.
Decompose multi-entity questions BEFORE retrieval. Split into 1-N focused sub-queries (cap=4) and run them in parallel via mcp__qmd__qmd_query.
Fall back to Grep only if QMD returns nothing relevant or for exact-string matches.
Never grep raw/ or ai-research/ for general questions. QMD does not index those layers.
The article exists to canonicalize that dependency — until 2026-05-09 the wiki’s primary retrieval layer was referenced in CLAUDE.md and memory but had no article of its own, which is exactly the kind of “missing concept” lint check #4 is supposed to catch.
The wiki’s Karpathy Techniques for Claude Code article frames the retrieval-tradeoff thesis: links over similarity, tokens-only cost, scale ceiling at hundreds of articles. QMD raises the ceiling — at 297 docs / 1,597 chunks today (and projected to push past 500), grep over wiki/ returns too many false-positive matches per query, burning context and degrading synthesis. QMD trades that grep-noise problem for a model-startup cost (~2GB on disk, ~2-3 seconds per query warm).
Karpathy wiki retrieval (markdown): QMD is the canonical answer. Already deployed.
~/Auto1111/hermes-agent/ codebase analysis:Graphify (MIT, multi-language, includes docs alongside code).
Targeted code-impact / refactoring agent:GitNexus’s 16 MCP tools win for this specific use case (license-permitting).
Cross-tool stack: QMD for the wiki + Graphify for the codebase + grep for exact-string lookups. They compose; they don’t compete.
Where this fits in the wiki
Substrate layer — QMD is underneath every Query operation in CLAUDE.md. The wiki’s synthadoc-borrowed query-decomposition pattern (split multi-entity questions into 1-4 sub-queries, run in parallel) presupposes QMD is doing the per-sub-query retrieval.
Adjacent to Graphify and GitNexus — three tools, three retrieval geometries, three domains. QMD = text retrieval over markdown. Graphify = entity-relationship graph over code+docs. GitNexus = code-graph + Graph RAG agent. Same article-cluster, different jobs.
Cross-listed pattern with synthadoc — both projects encode the “indexed knowledge structure + agent query layer” pattern. QMD is generic and tool-agnostic (any markdown vault); synthadoc is a Python engine + Obsidian plugin (5-pass IngestAgent, status-frontmatter, query decomposition, SQLite audit DB, 7 LLM providers) tightly bound to its own ingest discipline.
Hermes ecosystem cross-reference: Hermes Agent (Nous Research / Teknium) ships an official optional/research/qmd skill (v1.0.0, MIT). The Hermes wrapper exposes 5 MCP tool names (mcp_qmd_search, mcp_qmd_vsearch, mcp_qmd_deep_search, mcp_qmd_get, mcp_qmd_status) — slightly different naming than the bundled mcp__qmd__* Claude Code wrapper. Composes with Hermes deployments.
Composes with Claude Managed Agents — a Managed Agent could call QMD’s MCP server for retrieval inside long-running workflows. The HTTP-transport mode is specifically designed for shared, long-lived model loading across requests.
Pairs with The Expanding Toolkit (Lucas) — Anthropic’s “scaffolding moves into the model” thesis. QMD is local-scaffolding-not-in-the-model; the model gets the retrieved context as input and decides what to do with it.
Sibling pattern to last30days-skill — both are measurement-first tools (last30days quantifies ranked engagement; QMD ships its own bench subcommand for retrieval-quality eval). Both ship CLI-first with agent integration as a wrapper, validating the Printing Press thesis (CLI tier 1 / API tier 2 / MCP tier 3).
Calibrates Karpathy’s wiki-vs-semantic-RAG tradeoff — Karpathy’s pattern starts with grep-and-links; the wiki adopted QMD as the scale-ceiling fix without abandoning the wikilink layer. Both layers are load-bearing in production: wikilinks for navigation + curated cross-references; QMD for question-answering and gap-detection.
Implementation
Tool/Service: QMD (tobi/qmd v2.1.0) — local hybrid-search MCP for markdown.
Setup:
Install (npm):npm install -g @tobilu/qmd (note: package name is @tobilu/qmd, not qmd).
Tool: free, MIT, no commercial licensing friction.
Disk: ~2GB for models + ~60MB per ~300-doc collection (this user’s karpathy-wiki index is 62.1MB across 411 docs, 8497 vectors).
Compute: none ongoing — all local. CPU/GPU only during query and embedding.
No API tokens. Zero LLM cost regardless of query volume.
Integration notes:
Single-file SQLite index at ~/.cache/qmd/index.sqlite (per-user). Trivially backup / sync.
AST chunking active for ts/tsx/js/python/go/rust files — code identifiers indexed at function-class granularity, not paragraph-cut.
Daemon mode (HTTP) keeps models warm in memory across queries — drops latency from cold-start ~10s to warm ~2-3s. Recommended for any session that issues 5+ queries.
Multi-collection queries target by name (collections: ["wiki", "weomarketly-wiki"]) — agents can scope retrieval per question.
Structured query JSON lets agents specify retrieval mode per sub-query: {type: "lex", query: "..."} for exact strings, {type: "vec", query: "..."} for semantic, {type: "hyde", query: "..."} for hypothetical-document expansion.
qmd context add — attach a human-written summary to a collection or path. Improves reranker calibration. Cheap to populate; high marginal value.
qmd bench — search-quality fixture testing. Use it after model swaps (e.g., when changing QMD_EMBED_MODEL for multilingual).
qmd cleanup — clear caches, vacuum DB. Run quarterly or after deleting many docs.
Models customizable via env vars — QMD_EMBED_MODEL for embeddings, QMD_RERANKER_MODEL for the reranker. Multilingual = swap to Qwen3-Embedding (119 languages including CJK).
Open Questions
Reranker quality on technical / code-heavy content.qwen3-reranker-0.6b is a small reranker. For dense technical content (API references, code-with-prose mix), does it outperform the no-rerank baseline by enough to justify the latency? qmd bench is the answer mechanism but no public benchmark numbers ship with the README.
Index re-build cost at scale. The user’s wiki is 411 docs / 62MB indexed. At 5,000 docs how long does qmd update take, and is update truly incremental (delta only) or does it re-tokenize the whole collection?
HTTP daemon stability. Long-lived process holding 2GB of models in memory — what’s the failure mode at 24-hour uptime? Is there a built-in restart heuristic or does it leak / OOM?
Multilingual rerank quality. Swapping QMD_EMBED_MODEL to Qwen3-Embedding extends embeddings to 119 languages, but the reranker stays on Qwen3-Reranker-0.6B. Does the reranker handle non-English content well, or is that a known weak point?
Comparison vs cloud retrieval (e.g., Voyage-3, Cohere Rerank, OpenAI text-embedding-3). Honest local-vs-cloud quality numbers — most of the local-RAG space waves at this. QMD’s local-only architecture means zero ongoing cost but the quality ceiling is what it is. A qmd bench fixture run against the same dataset with OpenAI text-embedding-3-large would settle the question.
Tobi’s roadmap intent. Solo-maintained side-project from a CEO with a day job. What’s the maintenance trajectory? Per the HN thread the author is “working on finetuning better models for query extension and reranking (finetune branch)” — but cadence post-launch is harder to commit to long-term.
Multi-collection cross-query relevance. When a query spans collections, does the reranker preserve per-collection context or flatten into one ranked list? The Hermes wrapper’s structured JSON has a collections: ["..."] field per-search; whether the reranker treats them differently isn’t documented.
Try It
One-command install + smoke test.brew install qmd && qmd collection add notes ~/notes --pattern "**/*.md" && qmd embed && qmd query "<question about your notes>". Total time including the ~2GB model download: ~10 minutes on a decent connection.
Wire it into a Claude Code project. Drop the .mcp.json snippet from the Implementation section into your project root, restart Claude Code, then ask the assistant a question that requires retrieval. The MCP qmd server will appear in tool discovery; the assistant uses mcp__qmd__qmd_query automatically when relevant.
Compare grep vs qmd query on the same question. Pick a vague question against a 100+-doc markdown vault. Run grep -ri "..." and qmd query "..." side-by-side. Note where each wins. (Spoiler: grep is unbeatable for exact strings; QMD wins for “I know I wrote about this but can’t remember the words.“)
Add a context to a collection.qmd context add <collection> --description "Personal AI engineering notes; mixes Claude Code tooling, prompt engineering, and applied case studies". Re-run a query and note whether reranker scores shift.
Run the bench harness. Build a 20-question fixture file ({question, expected_paths}), qmd bench fixture.json. Use the score baseline before and after swapping QMD_EMBED_MODEL to gauge model-swap impact.
Daemon mode for active sessions. Start the HTTP daemon (qmd mcp --transport http --port 7799), point your client at it. Models stay warm; query latency drops to ~2-3s consistently. Worth it once your session issues more than ~5 queries.
For Hermes Agent users:hermes skills install official/research/qmd and use the wrapper’s mcp_qmd_deep_search for hybrid queries. Same engine, slightly different tool naming.
For the karpathy wiki specifically: the index already exists. Refresh after every ingest with qmd update && qmd embed (or use bin/post-ingest which wraps it). Live status via qmd status. Per-query examples live in this wiki’s CLAUDE.md § Wiki Retrieval and § Query operation decomposition pattern.
Related
Graphify — Cross-Harness Knowledge-Graph Skill — sister tool in the “indexed knowledge layer” niche; entity-relationship graph instead of ranked retrieval, code+docs instead of markdown-only
synthadoc — axoviq-ai vault engine — encodes the indexed-knowledge-structure pattern with its own ingest discipline; QMD is the substrate, synthadoc is one possible authoring layer above it