Source: qmd README v2.1.0 (2026-05-09) (github.com/tobi/qmd); Hermes Agent QMD skill docs (2026-05-09)

QMD is Tobias Lütke’s (Shopify CEO) local-first hybrid-search engine for markdown knowledge bases — BM25 + vector + LLM-reranking with query expansion, all running on-device via node-llama-cpp + GGUF models. Three CLI commands (qmd search BM25, qmd vsearch vector, qmd query hybrid + LLM rerank) plus an MCP server expose query / get / multi_get / status to Claude Code, Claude Desktop, Hermes Agent, and any other MCP client. ~2GB of models auto-download on first run; queries return in ~2-3 seconds with models warm. MIT, TypeScript, 24,467 stars / 1,539 forks at fetch (2026-05-09), created 2025-12-08, last push 2026-05-03. This wiki runs on QMD as its primary retrieval layer — installed 2026-05-04, indexes the wiki/ collection (currently 298 docs / 8,497 vectors), and the wiki’s CLAUDE.md instructs every Query / Lint / Cross-link / file-back-chain operation to call mcp__qmd__qmd_query before falling back to grep. The article exists to canonicalize that dependency.

Key Takeaways

  • Author identity is not incidental. Tobias Lütke is Shopify’s CEO. He posted QMD to Hacker News personally as xal in February 2026: “hi orange page, author here. QMD is an implementation of the best practices that I picked up in meetings with teams that work in search and retrieval. I tried to make it not overkill and keep things local.” Sole maintainer; the project is explicitly his side-project hobby. Adoption follows the author — Addy Osmani recommended it on LinkedIn, Hermes Agent (Nous Research / Teknium) ships an official skill wrapper, Raycast has an extension, the pi coding agent has an extension. The Shopify-CEO factor is what made this go from a 25-point HN post to 24k stars in 5 months.
  • Hybrid pipeline is the substantive engineering claim. Five stages: (1) LLM query expansion (1.7B fine-tuned model) generates 2 query variants alongside the original. (2) Parallel BM25 + vector retrieval for each of the 3 query variants → 6 ranked lists. (3) Reciprocal Rank Fusion with k=60, original-query weight ×2, top-rank bonus (+0.05 for #1, +0.02 for 2-3) → top 30 candidates. (4) LLM re-ranking via qwen3-reranker-0.6b (yes/no classification with logprobs). (5) Position-aware blending — Top 1-3 trust 75% RRF / 25% reranker; Top 4-10 trust 60/40; Top 11+ trust 40/60. The position-aware blend is the “exact keyword matches don’t get buried by the reranker” insurance policy — a real failure mode in pure-rerank pipelines.
  • AST-aware chunking via tree-sitter for code files. Not just markdown boundary detection (headings, code fences, paragraphs at ~900 tokens with 15% overlap) — for code files in TypeScript / TSX / JavaScript / Python / Go / Rust, QMD parses the AST and chunks at function and class boundaries. Same engineering choice GitNexus and Graphify make for code analysis, but applied to retrieval rather than topology.
  • All-local. No API calls. ~2GB of models auto-download once.
    • Embeddings: embeddinggemma-300M (default; 300M params). QMD_EMBED_MODEL env var swaps in alternatives — Qwen3-Embedding covers 119 languages including CJK for multilingual vaults.
    • Reranker: qwen3-reranker-0.6b (yes/no logprob classification).
    • Query expansion: fine-tuned 1.7B model.
    • All three run via node-llama-cpp + GGUF format. No OpenAI key, no Anthropic key, no telemetry.
  • Index storage = SQLite + FTS5. BM25 implemented via SQLite’s FTS5 virtual-table extension — explains the Homebrew SQLite required on macOS prerequisite (Apple’s bundled SQLite ships without FTS5 / loadable extensions). Vectors stored alongside in the same SQLite DB. Single-file index is portable, backup-friendly, and trivially git-trackable if you want.
  • Three operating modes serve three different use cases.
    • qmd search — pure BM25, no models loaded, near-instant. Use for exact-string lookups and code identifiers.
    • qmd vsearch — vector-only, ~3 seconds with models warm. Use for “I know I wrote about this but can’t remember the exact words.”
    • qmd query — full hybrid pipeline (expansion → BM25+vec → RRF → rerank → blend), ~2-3 seconds warm. The recommended default. The MCP query tool accepts a structured searches array with per-sub-query type (lex / vec / hyde) — the agent picks the retrieval mode per sub-query. This enables the synthadoc-style decomposition pattern the wiki’s own CLAUDE.md Query operation now follows.
  • MCP tool surface is deliberately minimal. Four tools — query, get, multi_get, status. Compare to GitNexus’s 16 MCP tools and Graphify’s 4. QMD’s surface is small because the retrieval pipeline does the heavy lifting; the agent’s job is to ask questions, not to specify retrieval strategy. The query tool’s structured-JSON interface (searches: [{type, query}]) is where the expressiveness lives.
  • Multiple interfaces — CLI, Node.js/Bun SDK, MCP, daemon.
    • CLI: all primary commands (query / search / vsearch / get / multi-get / status / update / embed / cleanup / bench / collection / context).
    • SDK: createStore() accepts inline config, YAML files, or database-only reopen; methods search(), searchLex(), searchVector(), get(), multiGet(), addCollection().
    • MCP server: qmd mcp starts stdio transport (for Claude Code / Hermes Agent / any local MCP client). HTTP transport also supported for shared, long-lived model loading across requests — recommended for “frequent users” because models stay warm in memory for consistent ~2-3 second query latency.
  • Multi-collection support is first-class, not bolted-on. qmd collection add lets you index multiple folders separately under different names; queries can target one collection (collections: ["wiki"]) or run across all. This user runs two collectionskarpathy-wiki (298 files) and weomarketly-wiki (113 files) — and the MCP server’s session-start instructions distinguish them explicitly. Means one QMD install spans personal vault + agency vault + project docs without separate indexes per workspace.
  • qmd context add attaches human-written summaries to collections / paths. A descriptive metadata layer that helps the LLM reranker understand what kind of content this collection is, improving relevance scoring. Worth populating once per collection — otherwise the reranker is judging in the abstract.
  • Bundled SKILL.md ships in the npm package. qmd skill show displays it; qmd skill install writes to ~/.claude/skills/. The skill is MIT, by @tobi, version 2.0.0, with allowed-tools: Bash(qmd:*), mcp__qmd__*. Self-installing skill is a clean pattern — most third-party tools require manual SKILL.md authoring; QMD ships its own.
  • Output formats --json and --files are designed for agentic workflows. --json returns structured results (path, score, snippet, metadata); --files returns just the path list, suitable for piping into other commands. Same pattern last30days-skill uses for agent consumption — “the CLI’s output format is the agent’s API.”
  • Position-aware blending is a non-obvious calibration that prevents a real failure mode. Pure-LLM-rerank pipelines tend to bury exact keyword matches under semantically-similar-but-wrong results, because the reranker doesn’t know the user actually meant that string. QMD’s tiered RRF/reranker weighting (75% RRF for top-3, 60% for 4-10, 40% for 11+) means: at the top of the results, lexical signal dominates; further down, semantic signal dominates. Engineering taste call — most open-source RAG systems don’t do this and ship worse default ranking as a result.
  • Performance discipline — --max-docs-per-batch and --max-batch-mb for embedding generation. Memory-bounded indexing at scale. Important when first-indexing a 1000-document vault on a laptop — without these caps the embedding step OOMs the Node process.
  • qmd bench <fixture.json> is a search-quality testbench. Run benchmarks against a fixture file to evaluate retrieval quality after model swaps or index tweaks. Eval-first discipline shipped in the CLI — most retrieval tools don’t include their own quality eval; QMD does.

Why this article exists — wiki retrieval substrate

This wiki has used QMD as its primary retrieval layer since 2026-05-04. The wiki’s CLAUDE.md § Wiki Retrieval enforces:

  1. qmd query "<question>" first. QMD is installed as an MCP server (mcp__qmd__* tools) and indexes the wiki/ directory only — collection name karpathy-wiki, 297 docs / 1597 chunks as of 2026-05-04. Hybrid BM25+vector+LLM-rerank, fully local, free.
  2. Decompose multi-entity questions BEFORE retrieval. Split into 1-N focused sub-queries (cap=4) and run them in parallel via mcp__qmd__qmd_query.
  3. Fall back to Grep only if QMD returns nothing relevant or for exact-string matches.
  4. Never grep raw/ or ai-research/ for general questions. QMD does not index those layers.

The article exists to canonicalize that dependency — until 2026-05-09 the wiki’s primary retrieval layer was referenced in CLAUDE.md and memory but had no article of its own, which is exactly the kind of “missing concept” lint check #4 is supposed to catch.

The wiki’s Karpathy Techniques for Claude Code article frames the retrieval-tradeoff thesis: links over similarity, tokens-only cost, scale ceiling at hundreds of articles. QMD raises the ceiling — at 297 docs / 1,597 chunks today (and projected to push past 500), grep over wiki/ returns too many false-positive matches per query, burning context and degrading synthesis. QMD trades that grep-noise problem for a model-startup cost (~2GB on disk, ~2-3 seconds per query warm).

How it compares to adjacent tools

DimensionQMD (tobi)GitNexusGraphify
DomainMarkdown text retrievalCode structural intelligenceCode + docs structural graph
GeometryHybrid scoring → ranked-list retrievalKnowledge graph + Graph RAG agentKnowledge graph + viewer report
StackTypeScript + node-llama-cpp + GGUF + SQLite/FTS5TypeScript + LadybugDB + WebAssemblyPython + tree-sitter + LLM (your assistant’s model)
Local vs APIFully local (auto-download GGUF models)Fully local (browser-side WASM or CLI)Code = local (tree-sitter); docs/PDFs/images = LLM API
MCP tools4 (query, get, multi_get, status)16 (impact analysis, multi-file rename, etc.)4 (query_graph, get_node, get_neighbors, shortest_path)
OutputRanked snippets (JSON or file paths)Graph + 16 MCP queries3 files (HTML viewer + Markdown report + JSON)
LicenseMITPolyForm-NoncommercialMIT
Stars / age24,467 / 5 months37,048 / 9 months45,493 / 5 weeks
AuthorTobias Lütke (Shopify CEO)Abhigyan Patwari (akonlabs.com)Safi Shamsi (graphifylabs.ai / Penpax)
Best fitMarkdown knowledge bases (notes, wikis, docs)Code-impact analysis, refactoringCross-modal corpora (code + docs + media)

Practical decision rule for this user:

  • Karpathy wiki retrieval (markdown): QMD is the canonical answer. Already deployed.
  • ~/Auto1111/hermes-agent/ codebase analysis: Graphify (MIT, multi-language, includes docs alongside code).
  • Targeted code-impact / refactoring agent: GitNexus’s 16 MCP tools win for this specific use case (license-permitting).
  • Cross-tool stack: QMD for the wiki + Graphify for the codebase + grep for exact-string lookups. They compose; they don’t compete.

Where this fits in the wiki

  • Substrate layer — QMD is underneath every Query operation in CLAUDE.md. The wiki’s synthadoc-borrowed query-decomposition pattern (split multi-entity questions into 1-4 sub-queries, run in parallel) presupposes QMD is doing the per-sub-query retrieval.
  • Adjacent to Graphify and GitNexus — three tools, three retrieval geometries, three domains. QMD = text retrieval over markdown. Graphify = entity-relationship graph over code+docs. GitNexus = code-graph + Graph RAG agent. Same article-cluster, different jobs.
  • Cross-listed pattern with synthadoc — both projects encode the “indexed knowledge structure + agent query layer” pattern. QMD is generic and tool-agnostic (any markdown vault); synthadoc is a Python engine + Obsidian plugin (5-pass IngestAgent, status-frontmatter, query decomposition, SQLite audit DB, 7 LLM providers) tightly bound to its own ingest discipline.
  • Hermes ecosystem cross-reference: Hermes Agent (Nous Research / Teknium) ships an official optional/research/qmd skill (v1.0.0, MIT). The Hermes wrapper exposes 5 MCP tool names (mcp_qmd_search, mcp_qmd_vsearch, mcp_qmd_deep_search, mcp_qmd_get, mcp_qmd_status) — slightly different naming than the bundled mcp__qmd__* Claude Code wrapper. Composes with Hermes deployments.
  • Composes with Claude Managed Agents — a Managed Agent could call QMD’s MCP server for retrieval inside long-running workflows. The HTTP-transport mode is specifically designed for shared, long-lived model loading across requests.
  • Pairs with The Expanding Toolkit (Lucas) — Anthropic’s “scaffolding moves into the model” thesis. QMD is local-scaffolding-not-in-the-model; the model gets the retrieved context as input and decides what to do with it.
  • Sibling pattern to last30days-skill — both are measurement-first tools (last30days quantifies ranked engagement; QMD ships its own bench subcommand for retrieval-quality eval). Both ship CLI-first with agent integration as a wrapper, validating the Printing Press thesis (CLI tier 1 / API tier 2 / MCP tier 3).
  • Calibrates Karpathy’s wiki-vs-semantic-RAG tradeoff — Karpathy’s pattern starts with grep-and-links; the wiki adopted QMD as the scale-ceiling fix without abandoning the wikilink layer. Both layers are load-bearing in production: wikilinks for navigation + curated cross-references; QMD for question-answering and gap-detection.

Implementation

  • Tool/Service: QMD (tobi/qmd v2.1.0) — local hybrid-search MCP for markdown.
  • Setup:
    • Install (npm): npm install -g @tobilu/qmd (note: package name is @tobilu/qmd, not qmd).
    • Install (Bun): bun install -g @tobilu/qmd (per Raycast extension prerequisites).
    • Install (Homebrew): brew install qmd (this user’s path; /opt/homebrew/bin/qmd).
    • Prereq (macOS): brew install sqlite — Apple’s bundled SQLite lacks FTS5 extension support.
    • Prereq (Node): Node 22+ or Bun 1.0+.
    • First-run model download: ~2GB pulled from HuggingFace, cached locally. One-time.
    • Add a collection: qmd collection add <name> <path> --pattern "**/*.md". Multiple collections are first-class.
    • Generate embeddings: qmd embed (or qmd embed -f to force re-embed). Once per collection unless content changes.
    • Re-index after edits: qmd update (or qmd update --pull to git-pull first). Wire into a post-commit hook for live freshness.
    • MCP wiring (Claude Code): in .mcp.json:
      {
        "mcpServers": {
          "qmd": {
            "type": "stdio",
            "command": "qmd",
            "args": ["mcp"],
            "env": {}
          }
        }
      }
      Then restart Claude Code to activate.
  • Cost:
    • Tool: free, MIT, no commercial licensing friction.
    • Disk: ~2GB for models + ~60MB per ~300-doc collection (this user’s karpathy-wiki index is 62.1MB across 411 docs, 8497 vectors).
    • Compute: none ongoing — all local. CPU/GPU only during query and embedding.
    • No API tokens. Zero LLM cost regardless of query volume.
  • Integration notes:
    • Single-file SQLite index at ~/.cache/qmd/index.sqlite (per-user). Trivially backup / sync.
    • AST chunking active for ts/tsx/js/python/go/rust files — code identifiers indexed at function-class granularity, not paragraph-cut.
    • Daemon mode (HTTP) keeps models warm in memory across queries — drops latency from cold-start ~10s to warm ~2-3s. Recommended for any session that issues 5+ queries.
    • Multi-collection queries target by name (collections: ["wiki", "weomarketly-wiki"]) — agents can scope retrieval per question.
    • Structured query JSON lets agents specify retrieval mode per sub-query: {type: "lex", query: "..."} for exact strings, {type: "vec", query: "..."} for semantic, {type: "hyde", query: "..."} for hypothetical-document expansion.
    • qmd context add — attach a human-written summary to a collection or path. Improves reranker calibration. Cheap to populate; high marginal value.
    • qmd bench — search-quality fixture testing. Use it after model swaps (e.g., when changing QMD_EMBED_MODEL for multilingual).
    • qmd cleanup — clear caches, vacuum DB. Run quarterly or after deleting many docs.
    • Models customizable via env varsQMD_EMBED_MODEL for embeddings, QMD_RERANKER_MODEL for the reranker. Multilingual = swap to Qwen3-Embedding (119 languages including CJK).

Open Questions

  • Reranker quality on technical / code-heavy content. qwen3-reranker-0.6b is a small reranker. For dense technical content (API references, code-with-prose mix), does it outperform the no-rerank baseline by enough to justify the latency? qmd bench is the answer mechanism but no public benchmark numbers ship with the README.
  • Index re-build cost at scale. The user’s wiki is 411 docs / 62MB indexed. At 5,000 docs how long does qmd update take, and is update truly incremental (delta only) or does it re-tokenize the whole collection?
  • HTTP daemon stability. Long-lived process holding 2GB of models in memory — what’s the failure mode at 24-hour uptime? Is there a built-in restart heuristic or does it leak / OOM?
  • Multilingual rerank quality. Swapping QMD_EMBED_MODEL to Qwen3-Embedding extends embeddings to 119 languages, but the reranker stays on Qwen3-Reranker-0.6B. Does the reranker handle non-English content well, or is that a known weak point?
  • Comparison vs cloud retrieval (e.g., Voyage-3, Cohere Rerank, OpenAI text-embedding-3). Honest local-vs-cloud quality numbers — most of the local-RAG space waves at this. QMD’s local-only architecture means zero ongoing cost but the quality ceiling is what it is. A qmd bench fixture run against the same dataset with OpenAI text-embedding-3-large would settle the question.
  • Tobi’s roadmap intent. Solo-maintained side-project from a CEO with a day job. What’s the maintenance trajectory? Per the HN thread the author is “working on finetuning better models for query extension and reranking (finetune branch)” — but cadence post-launch is harder to commit to long-term.
  • Multi-collection cross-query relevance. When a query spans collections, does the reranker preserve per-collection context or flatten into one ranked list? The Hermes wrapper’s structured JSON has a collections: ["..."] field per-search; whether the reranker treats them differently isn’t documented.

Try It

  1. One-command install + smoke test. brew install qmd && qmd collection add notes ~/notes --pattern "**/*.md" && qmd embed && qmd query "<question about your notes>". Total time including the ~2GB model download: ~10 minutes on a decent connection.
  2. Wire it into a Claude Code project. Drop the .mcp.json snippet from the Implementation section into your project root, restart Claude Code, then ask the assistant a question that requires retrieval. The MCP qmd server will appear in tool discovery; the assistant uses mcp__qmd__qmd_query automatically when relevant.
  3. Compare grep vs qmd query on the same question. Pick a vague question against a 100+-doc markdown vault. Run grep -ri "..." and qmd query "..." side-by-side. Note where each wins. (Spoiler: grep is unbeatable for exact strings; QMD wins for “I know I wrote about this but can’t remember the words.“)
  4. Add a context to a collection. qmd context add <collection> --description "Personal AI engineering notes; mixes Claude Code tooling, prompt engineering, and applied case studies". Re-run a query and note whether reranker scores shift.
  5. Run the bench harness. Build a 20-question fixture file ({question, expected_paths}), qmd bench fixture.json. Use the score baseline before and after swapping QMD_EMBED_MODEL to gauge model-swap impact.
  6. Daemon mode for active sessions. Start the HTTP daemon (qmd mcp --transport http --port 7799), point your client at it. Models stay warm; query latency drops to ~2-3s consistently. Worth it once your session issues more than ~5 queries.
  7. For Hermes Agent users: hermes skills install official/research/qmd and use the wrapper’s mcp_qmd_deep_search for hybrid queries. Same engine, slightly different tool naming.
  8. For the karpathy wiki specifically: the index already exists. Refresh after every ingest with qmd update && qmd embed (or use bin/post-ingest which wraps it). Live status via qmd status. Per-query examples live in this wiki’s CLAUDE.md § Wiki Retrieval and § Query operation decomposition pattern.