RAG and Vector Retrieval for Agents

Source: ai-research/nvidia-agentic-rag-vs-traditional-rag-2026.md, ai-research/mlm-agentic-rag-3-levels-2026.md, ai-research/weaviate-rag-chunking-strategies-2026.md, ai-research/tds-hybrid-search-reranking-production-rag-2026.md, ai-research/redgate-rag-hallucination-failure-points-2026.md, ai-research/snorkel-rag-failure-modes-2026.md, ai-research/digitalapplied-agentic-rag-patterns-2026.md, ai-research/atlan-what-is-rag-taxonomy-2026.md, ai-research/redhat-tool-rag-agent-scaling-2026.md; wiki case study: qmd-hybrid-search

RAG (retrieval-augmented generation) is the pattern of giving a language model access to an external knowledge store at answer time, instead of relying only on what it memorized during training. This wiki uses the term constantly — QMD runs a hybrid BM25+vector+rerank pipeline as this very wiki’s retrieval layer, and several other articles argue that compiled, curated knowledge beats raw RAG at this wiki’s scale — but no article has defined the term end to end. This one does: what RAG actually is, the building blocks that make it work (embeddings, chunking, vector stores, hybrid search), where it breaks in production, and how agent architectures use retrieval differently than a plain RAG chatbot does.

Key Takeaways

RAG = retriever + generator, nothing more exotic than that. A retriever finds relevant text from an external store; a generator (an LLM) turns that retrieved text plus the user’s question into an answer. Formalized by Lewis et al. at Meta AI Research in a 2020 paper; by 2026 it’s the default architecture for grounding an LLM in data it wasn’t trained on.
Chunking is the highest-leverage, most underinvested step in the whole pipeline. How documents get split before embedding determines what can ever be retrieved — no amount of model quality fixes a bad chunk boundary downstream.
Hybrid search (BM25 + vector + rerank) consistently beats either retrieval method alone. This wiki runs exactly this pattern today: QMD is a live, concrete instance, not a hypothetical.
Most production RAG failures are retrieval failures wearing a hallucination costume. The model usually does exactly what it should with the context it was handed — the context itself was wrong, incomplete, or misleadingly similar.
Agentic RAG turns retrieval into a tool the agent calls repeatedly, not a preprocessing step run once — at 3-10x the token cost of single-pass RAG, so it’s an escalation path, not a default.
Tool RAG is a distinctly agent-only extension with no chatbot equivalent — retrieving the right tool definitions from a large registry instead of the right documents, because a huge tool list overloads context the same way a huge document corpus does.
This wiki already argues RAG isn’t always the right layer — see Pinecone Nexus and Karpathy’s Techniques. This article is the missing foundation those arguments assume the reader already has.

What RAG Actually Is

The plain-language version: instead of asking a language model to answer purely from what it memorized during training (its “parametric memory”), a RAG system first searches an external knowledge source for relevant material, then hands that material to the model as extra context before it generates an answer.
Origin: the term comes from a 2020 paper, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (Lewis et al., Meta AI Research, with University College London and NYU), which paired a pretrained sequence-to-sequence model with a dense passage retriever.
Why it exists — three problems it solves more cheaply than fine-tuning or ever-bigger context windows:
- Freshness. Model weights are frozen at training time. A retrieval index can be updated continuously without retraining anything.
- Privacy and specificity. Internal docs, a private codebase, a client’s data were never in the model’s training set and shouldn’t need to be — RAG lets the model reason over them without a training run touching them.
- Auditability. A RAG answer can cite the specific chunk it came from. A parametric answer can’t — there’s nothing to point to.
RAG is not semantic search, and it’s not fine-tuning. Semantic search is one component RAG uses (find the relevant text); RAG is the full loop that feeds that text to a generator to produce a synthesized answer. Fine-tuning bakes knowledge into model weights; RAG keeps knowledge external and swappable. The practical tell: if updating your knowledge base means retraining a model, that isn’t RAG.
RAG and long context are complementary, not competing, by 2026. Stuffing an entire corpus into a huge context window is expensive per call and still suffers the “lost in the middle” attention problem on anything in the middle of a long prompt; it also doesn’t help when the answer needs data that changed after the context was assembled. The common production shape pairs them: RAG narrows a large corpus down to a small relevant candidate set, and the full retrieved documents — not truncated chunks — go into a moderate-length context window from there. ^[inferred]

The Building Blocks

Embeddings

An embedding model turns a chunk of text into a fixed-length vector of numbers, positioned so that texts with similar meaning end up geometrically close together.
This is a compression, and compression loses information. A bi-encoder squeezes an entire chunk’s meaning into one vector — which is exactly why a chunk about “exponential backoff” and a chunk about “dead-letter queue threshold” can end up close together in vector space even though a human reader would never confuse them (see Failure Modes below).
Embedding models have hard token limits (historically ~8,191 tokens for a model like text-embedding-ada-002) — oversized chunks get silently truncated rather than gracefully handled.
Domain-specific or fine-tuned embedding models capture domain vocabulary better than general-purpose ones. QMD, this wiki’s own retrieval layer, defaults to embeddinggemma-300M and can swap to Qwen3-Embedding for multilingual (119-language) coverage via an environment variable — see QMD.

Chunking

Chunking is splitting documents into indexable pieces before embedding them — the step most teams under-invest in relative to its impact on retrieval quality.
The core tradeoff: chunks too large dilute the embedding (the “average everything together” problem above); chunks too small lose the surrounding context a reader — human or model — needs to interpret them correctly.
A useful gut-check: if a chunk makes sense to you, read in isolation, it’ll make sense to the LLM too. If it doesn’t, no retrieval algorithm downstream will fix that.
Common strategies, roughly in order of sophistication: fixed-size (character or token count, with overlap), recursive (respect paragraph and sentence boundaries first), document-structure-aware (split on Markdown headings or HTML tags), and semantic (cluster by meaning rather than position).
Metadata and parent-child retrieval mitigate the tradeoff rather than eliminating it: attach title, section, and date metadata to each chunk, and store a link from a small retrieved chunk back to its larger parent section so the system can expand context on demand (“small-to-big” retrieval) instead of guessing the right chunk size up front.
QMD’s concrete answer: boundary-aware chunking at roughly 900 tokens with 15% overlap for markdown (headings, code fences, paragraphs), but full tree-sitter AST parsing for code files — chunking at function and class boundaries instead of arbitrary character counts. Same AST-chunking choice GitNexus and Graphify make, applied to retrieval instead of graph topology.

Vector Stores

A vector database (or a vector index inside a general-purpose database) stores every chunk’s embedding and finds the nearest neighbors to a query embedding, usually by cosine similarity.
Cosine similarity measures topical proximity, not correctness. It’s a proxy for “these are about the same kind of thing,” not “this is the specific fact you need.” A query for “Product Version 3.2” can retrieve a highly-scored chunk about “Version 3.1” purely because the text overlap is 95% identical — a specifically wrong retrieval, not a random one.
At this wiki’s scale (hundreds of markdown files), the vector store is a single SQLite file with an FTS5 virtual table for the keyword side and vectors stored alongside — QMD’s entire index is one portable, git-trackable file. Enterprise-scale RAG reaches for dedicated vector databases (Pinecone, Weaviate, Qdrant, pgvector) instead, but the underlying nearest-neighbor-search idea is identical.

Hybrid Search — BM25 + Vector + Rerank

This is the part of the stack this wiki runs in production, so it earns the most detail.

BM25 (keyword/sparse search) scores a document against a query using term frequency, inverse document frequency (rare terms count more), and length normalization. It’s exact-match-only — it can’t see that “configuration override” and “custom settings” mean the same thing — but that rigidity is also its strength: it never loses an exact technical term the way a compressed embedding can.
Dense vector search (semantic) finds conceptually related text even with zero keyword overlap — it catches “escalation” when the source document says “severity triage.” It’s the complementary failure mode to BM25’s rigidity.
Fusion combines both ranked lists, most commonly via Reciprocal Rank Fusion (RRF), which merges by rank position rather than raw score — necessary because BM25 scores are unbounded while vector similarity scores are normalized 0–1, so the two aren’t directly comparable.
Reranking scores the fused candidates again, this time with a cross-encoder that looks at the full query and each candidate together, rather than as two independently-compressed vectors — far more accurate, far more expensive, which is why it only runs on the small top-N survivors of fusion rather than the whole corpus.
Measured impact, one production case study: hybrid search over dense-only search raised Context Recall from 0.74 to 0.83 (BM25 pulling in exact-term matches the dense model had ranked too low); adding a reranker on top raised Context Precision from 0.71 to 0.79 by pushing near-miss chunks out of the results actually passed to the model. Recall and precision are different failure modes, fixed at different pipeline stages — this is why the two-stage design exists instead of picking one technique.
QMD is this exact pattern, running inside this wiki right now: (1) LLM query expansion generates 2 extra phrasings of the question; (2) BM25 and vector search run in parallel across all 3 query variants; (3) Reciprocal Rank Fusion merges the 6 resulting ranked lists; (4) an LLM reranker (qwen3-reranker-0.6b) re-scores the top candidates; (5) position-aware blending weights RRF more heavily at the very top of the results (75% RRF / 25% reranker for the top 3) and the reranker more heavily further down — a calibration that stops the reranker from burying an exact keyword hit under something merely similar. Full detail and setup instructions: QMD — Local Hybrid-Search MCP.

Agentic RAG — How Agents Use Retrieval Differently Than Chatbots

Traditional (naive) RAG is a single pass: retrieve once, generate once, done. Fixed pipeline — query in, ranked chunks out, answer out. Fast, cheap, and fine for simple factual lookups against a well-scoped corpus.
Agentic RAG makes retrieval a tool the agent decides to call, evaluate, and re-call — not a preprocessing step that runs automatically before generation. The agent reads what it retrieved, judges whether it actually answers the question, and chooses: retry with a refined query, decompose into sub-questions, pull from a different corpus, or commit to an answer.
Five canonical agentic-RAG patterns, useful as a checklist: iterative retrieval (retry with a better query), query decomposition (split a compound question into parts, retrieve each separately), hypothesis-driven retrieval (form a candidate answer, then retrieve to confirm or refute it), cross-corpus triangulation (check the same claim against multiple independent sources), and evidence-weighted synthesis (weigh conflicting retrieved evidence rather than averaging it).
This wiki already runs a version of pattern #2. The vault schema’s Query operation decomposes any question containing “vs,” “compared to,” or a conjunction joining 2+ named entities into up to 4 focused sub-queries, runs them in parallel against QMD, and merges the results — agentic RAG in miniature, without calling it that. ^[inferred]
The cost is real: agentic RAG burns roughly 3–10x more tokens than single-pass RAG. Treat it as an escalation path for queries that classic RAG demonstrably fails on, not a wholesale replacement — the same “start with the simplest pattern that solves your problem” discipline 12-Factor Agents argues for generally.
Tool RAG is a distinct, newer, agent-only extension of the same idea — and it has no equivalent in a plain RAG chatbot. Instead of retrieving relevant knowledge, the agent retrieves the relevant tool definitions out of a large tool registry before deciding what to call — the same embedding, hybrid-retrieval, and reranking mechanics applied to an agent’s own capability surface instead of a document corpus. The problem is real: giving a model dozens or hundreds of tools at once overloads its context the same way a huge document corpus would, and it starts picking the wrong tool or hallucinating an answer instead of acting. Anthropic’s own RAG-MCP research showed a basic tool-retrieval strategy boosting tool-selection accuracy from 13% to 43% on a large toolset while dramatically cutting prompt size; other reported results claim roughly triple tool-invocation accuracy at half the prompt length. Still early — mostly research-lab prototypes (COLT, Graph RAG-Tool Fusion, Tool2vec) rather than plug-and-play production software as of this writing — but the direction matters for any agent whose tool count is growing past what fits comfortably in a system prompt.
GraphRAG is the relationship-aware sibling pattern, retrieving connected entities across a knowledge graph instead of flat text similarity — the right tool when an answer depends on traversing relationships (who reports to whom, what depends on what) rather than finding topically similar passages. This wiki covers Graph RAG in depth via GitNexus and Graphify rather than re-explained here.

Common Failure Modes

Grounded in two independent practitioner write-ups — Red Gate’s six failure points and Snorkel AI’s retrieval/generation split (itself citing Barnett et al.’s “Seven Failure Points When Engineering a RAG System”). Most of these are retrieval failures that present as hallucinations: the model did exactly what it should with the context it was handed.

Missing content. The answer simply isn’t anywhere in the indexed corpus. The model either says so (good) or fills the gap from its own training data — a hallucination that looks like a retrieval success from the outside.
Chunking damage. A conditional clause split from the condition it depends on (for example, “…if the transaction exceeds €10M”) becomes a flatly wrong statement once retrieved in isolation. This is a chunking-time failure, not a generation-time one — fixing the prompt won’t fix it.
Cosine similarity’s blindness to precision. High embedding-space similarity is not the same claim as “this is the right answer.” A query about “Version 3.2” can retrieve “Version 3.1” content at the top of the ranked list because the text is 95% identical — confidently, specifically wrong, not randomly wrong.
Embedding-space mismatch after a model upgrade. Swap embedding models mid-corpus without re-indexing everything, and old chunks become vectors in a different mathematical space than new queries — a silent, hard-to-diagnose retrieval degradation.
Lost in the middle. Raising top-k “to be safe” backfires: models systematically over-attend to the start and end of a prompt and under-attend to the middle, so a critical chunk sitting at position 5 of 10 can be functionally invisible even though it was successfully retrieved.
Over-retrieval and noise flood. Pulling 20 near-duplicate chunks instead of the 3 that matter forces the model to reason over noise, which measurably dilutes accuracy and slows responses — more context is not a substitute for correct context.
Retrieved but not extracted. The answer is genuinely present in the retrieved chunks, but the model fails to pull it out cleanly because of surrounding noise, ambiguity, or a contradicting nearby chunk.
The practical fix most teams skip: measure retrieval precision (does the retrieved chunk actually contain the answer?) as a metric completely separate from generation quality or cosine similarity score. Most RAG debugging time goes into prompt tweaking when the bug is one layer down, in what got retrieved in the first place.

Where This Wiki Pushes Back — RAG vs. Compiled Knowledge

Several articles in this wiki use “RAG” as a foil rather than a recommendation — this primer is the definition those arguments assume. Read without it, shorthand like “no fancy RAG required,” “raw-source RAG,” or “agentic RAG spends ~85% of its effort on retrieval” is doing a lot of unexplained work.

Karpathy’s LLM-Wiki Techniques makes the core claim directly: at the scale of “hundreds of articles,” a curated wiki of links and indexes outperforms a semantic-RAG pipeline on cost — tokens only, no embeddings, vector database, or chunking pipeline to maintain. Semantic RAG wins above that, roughly hundreds of thousands of documents.
Pinecone Nexus covers the vector-database industry’s own admission of this dynamic: agentic RAG spends roughly 85% of its effort on retrieval, task completion plateaus around 50-60%, and a compiled knowledge layer — reasoning done once at ingest time instead of on every query — beat agentic RAG 100% vs. 98.7% task completion at roughly 8x lower token cost in Pinecone’s own benchmark.
Agent Wikis has the sharpest head-to-head number: same model, same token budget, a compiled wiki hit 89% correct / 7% hallucination versus raw-source RAG’s 63% / 26% on the same questions. The lever was curation, not retrieval method.
GBrain makes the same case for retrieval architecture rather than compilation: a self-wiring knowledge graph beat a vector-RAG baseline by 38 points of precision@5 on relationship-shaped questions (“who works at Acme?”) that cosine similarity structurally can’t answer.
The honest synthesis, not a contradiction: none of these articles claim RAG is obsolete. RAG is still the right default when a corpus is too large or too fast-changing to compile, and Pinecone Nexus’s own verdict is that agentic RAG remains the easier-to-wire choice for exploratory, long-tail questions. What this wiki argues is narrower: at bounded scale, with repeatable question shapes, moving the expensive reasoning from query-time to ingest-time — compiling, curating, or graph-structuring — beats leaving an agent to rediscover the same retrieval strategy from scratch on every query.

Try It

Feel the difference between keyword and hybrid retrieval firsthand. If QMD or a similar tool is available, run the same vague question through pure BM25 (qmd search), pure vector (qmd vsearch), and the full hybrid pipeline (qmd query), and compare what comes back — see QMD’s own Try It section for the exact commands.
Run the chunk gut-check. Pull one real chunk out of any RAG pipeline in use and read it with no other context. If it doesn’t make sense standalone, it won’t make sense to the model either — that’s a chunking fix, not a prompt fix.
Tune the hybrid-search blend empirically, not by guessing. If a vector store supports a BM25-vs-vector blend parameter (often called alpha), build a small labeled set of 20-50 real queries with known correct chunks, and measure precision and recall at a few blend values before picking a default.
Before building an agentic-RAG loop, prove single-pass RAG actually fails first. Log the cases where one-shot retrieval gives a wrong or incomplete answer, and add iteration or decomposition for that failure class specifically — it’s cheaper, and the value is provable rather than assumed.
Separate the evals. When debugging a RAG system that “sounds confident but is wrong,” check retrieval precision before touching the prompt — retrieval and generation are different failure modes that need different fixes, and most debugging time gets spent on the wrong one.

Open Questions

QMD’s own reranker-quality benchmarks against cloud rerankers (Voyage-3, Cohere Rerank) remain unresolved per its own Open Questions — this primer inherits that gap rather than resolving it.
This wiki has deep Graph RAG coverage (GitNexus, Graphify) and deep hybrid-search coverage (QMD), but no single article with a decision framework for choosing between hybrid search, GraphRAG, and compiled knowledge for a new project — a natural follow-up primer.
Tool RAG is flagged in the source material as still mostly research-prototype-stage as of late 2025/early 2026 (COLT, Graph RAG-Tool Fusion, Tool2vec) rather than a mature, production-ready category — worth a dedicated revisit once first-party production case studies exist.
None of the sources this article draws on report standardized numbers for how agentic-RAG token cost scales with corpus size specifically — the “3-10x” multiplier is reported as a general range, not tied to any particular retrieval-corpus size.

QMD — Local Hybrid-Search MCP for Markdown Knowledge Bases — this wiki’s own production hybrid-search implementation; the concrete worked example throughout this article
Karpathy’s LLM-Wiki Techniques for Claude Code — the “wiki vs. semantic-RAG” tradeoff thesis this primer underpins
Pinecone Nexus — A Compiled Knowledge Layer — the vector-database industry’s own critique of agentic RAG’s retrieval overhead
Agent Wikis — head-to-head accuracy and hallucination benchmark: compiled wiki vs. raw-source RAG
GBrain — Garry Tan’s Open-Source AI Brain — knowledge-graph retrieval outperforming vector RAG on relationship questions, with a full ablation benchmark
12-Factor Agents — HumanLayer’s Framework for Reliable LLM Applications — the context-engineering discipline (own your context window; retrieval over stuffing) that agentic RAG is one instance of
crawl4ai — Open-Source LLM-Friendly Web Crawler & Scraper — the ingestion side of a RAG pipeline: turning arbitrary web pages into the clean markdown a retriever can index
GitNexus — Zero-Server Code Intelligence Engine with Graph RAG — GraphRAG in production, the relationship-aware sibling pattern to hybrid search

Jonathon's AI Wiki

Explorer

RAG and Vector Retrieval for Agents — A Practical Primer

Key Takeaways

What RAG Actually Is

The Building Blocks

Embeddings

Chunking

Vector Stores

Hybrid Search — BM25 + Vector + Rerank

Agentic RAG — How Agents Use Retrieval Differently Than Chatbots

Common Failure Modes

Where This Wiki Pushes Back — RAG vs. Compiled Knowledge

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

RAG and Vector Retrieval for Agents — A Practical Primer

Key Takeaways

What RAG Actually Is

The Building Blocks

Embeddings

Chunking

Vector Stores

Hybrid Search — BM25 + Vector + Rerank

Agentic RAG — How Agents Use Retrieval Differently Than Chatbots

Common Failure Modes

Where This Wiki Pushes Back — RAG vs. Compiled Knowledge

Try It

Open Questions

Related

Graph View

Table of Contents

Backlinks