crawl4ai — Open-Source LLM-Friendly Web Crawler & Scraper

Source: ai-research/crawl4ai-github-readme-2026-06-16.md (https://github.com/unclecode/crawl4ai)

crawl4ai (by unclecode) is the most-starred web crawler on GitHub (~68.7k stars, 2026-06-16) — an Apache-2.0 Python library that turns arbitrary web pages into clean, LLM-ready Markdown for RAG, agents, and data pipelines. Unlike hosted scraping APIs it runs fully local with no API keys, accounts, or rate limits, and ships a Docker server exposing a REST API and an MCP server so Claude Code and other agents can call it directly.

Key Takeaways

“Turns the web into clean, LLM-ready Markdown.” The core output preserves headings, tables, and code blocks, plus a “Fit Markdown” variant that strips boilerplate/nav noise via PruningContentFilter or BM25ContentFilter — cutting tokens before content reaches an LLM.
Two extraction paths, one cost lever. JsonCssExtractionStrategy pulls structured data with CSS/XPath selectors and no LLM call (fast, free, deterministic); LLMExtractionStrategy uses a model (any OSS or proprietary provider) for schema-based extraction when rules aren’t enough. Prefer the CSS schema to avoid token spend; reach for the LLM strategy only for messy pages.
Real browser control. Built on Playwright: JS execution with dynamic-content waits, session persistence/auth, proxy support, custom headers/cookies/UA, Shadow-DOM flattening, lazy-load + full-page scroll, screenshots and PDF capture, iframe extraction, and an async browser pool for concurrent crawling.
Three deploy shapes. (1) pip install crawl4ai Python library; (2) Docker server (unclecode/crawl4ai) with a REST API, monitoring dashboard, interactive playground, browser pooling, and JWT auth; (3) MCP server on the Docker deployment for direct connection to Claude Code and other AI tools.
Free and unmetered. Apache-2.0, no keys or rate limits — you own the runtime. A commercial Cloud API is in closed beta (pitched as “drastically more cost-effective” than existing solutions).
The self-hosted OSS counterpart to managed fetch/scrape APIs like TinyFish Fetch, Browserbase, and Firecrawl — trade managed reliability for zero cost and full control.

Why It Matters for Agents & RAG

Markdown is what LLMs parse natively. crawl4ai is the client-side inverse of the agent-readable-web thesis: instead of waiting for sites to publish agent-friendly formats, it converts the existing human web into the markdown payload agents read best — on the fly, without the site’s cooperation.
Token economy is a first-class feature. Fit Markdown, content filters, and CSS-schema extraction all exist to reduce what you feed a model, which is the dominant cost in any scrape-to-LLM pipeline.
Drop-in for Claude Code. The Docker MCP server lets an agent crawl-to-markdown as a tool call rather than shelling out to a custom script.

Implementation

Tool/Service: crawl4ai — github.com/unclecode/crawl4ai · Apache-2.0 · Python · v0.8.9 (2026-06-04) · ~68.7k★

Setup (Python library):

pip install -U crawl4ai
crawl4ai-setup        # installs/configures the Playwright browsers

import asyncio
from crawl4ai import AsyncWebCrawler
 
async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://www.nbcnews.com/business")
        print(result.markdown)        # clean, LLM-ready markdown
 
asyncio.run(main())

Setup (Docker server + REST API + MCP):

docker pull unclecode/crawl4ai:latest
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest
# dashboard:  http://localhost:11235/dashboard
# REST API:   HTML extract, screenshot, PDF, JS execution (JWT-gated)
# MCP server: connects to Claude Code and other AI tools

Cost: Free / open-source; no API keys or rate limits. Commercial Cloud API in closed beta.

Integration notes:

Extraction strategies: JsonCssExtractionStrategy (CSS/XPath, no LLM), LLMExtractionStrategy (provider-configurable), LLMTableExtraction (chunks large tables for LLM parsing).
Fit Markdown is produced by configuring a content filter — PruningContentFilter or BM25ContentFilter. Chunking is topic / regex / sentence based, with cosine-similarity semantic selection.
Recent releases: v0.8.9 SSRF/Docker-API security patches; v0.8.5 anti-bot with proxy escalation + Shadow-DOM flattening; v0.8.0 deep-crawl crash recovery (resume_state / on_state_change) + prefetch=True (5-10× faster URL discovery).
--shm-size=1g matters: Chromium needs shared memory and crashes under concurrency without it.

Try It

Smoke test in 60 seconds: pip install -U crawl4ai && crawl4ai-setup, then run the AsyncWebCrawler snippet against a content-heavy URL. Add a PruningContentFilter and re-run to watch the Fit Markdown variant shrink the token count.
Wire it into Claude Code as a tool: run the Docker server and point an MCP client at the :11235 MCP endpoint — crawl-to-markdown becomes a callable tool instead of a bespoke script.
Skip the LLM where you can: for a known site, write a JsonCssExtractionStrategy schema (CSS/XPath) to pull structured fields deterministically at zero token cost; fall back to LLMExtractionStrategy only for unstructured pages.
Benchmark it against your hosted scraper: crawl the same 20 URLs through crawl4ai and your current TinyFish/Firecrawl path; compare token counts (Fit Markdown vs raw) and per-page cost before committing.

RAG and Vector Retrieval for Agents — A Practical Primer — the RAG pipeline crawl4ai’s Fit Markdown output feeds into: what happens next at chunking, embedding, and hybrid-search retrieval time.
TinyFish — Web Infra APIs for AI Agents — managed Search/Fetch/Browser; crawl4ai is the self-hosted OSS counterpart (TinyFish’s launch names Firecrawl as a competitor).
Firecrawl — the closest hosted competitor; genuinely overlapping on scrape/crawl, but Firecrawl’s /monitor (scheduled, AI-judged change and discovery detection) has no crawl4ai equivalent.
Browserbase Autobrowse — managed browser harness whose “probe with fetch first, escalate only if dynamic/gated” lesson maps directly onto crawl4ai’s fetch-first crawl.
CloakBrowser — Playwright-replacement focused on anti-bot fingerprint patches; crawl4ai covers the same stealth surface (proxy escalation, Shadow-DOM flattening) inside a full crawler.
Shopify Review Scraper — free, self-hosted, single-platform-deep Playwright scraper; same local-OSS posture, narrower scope.
ScrapeCreators — paid social-platform-deep scraping API; the managed counterpart for sites a generic crawler can’t easily reach.
The Agent-Readable Web — crawl4ai is the client-side inverse: it makes the human web agent-readable without the site re-plumbing itself.
Cowork + Apify Scraping — per-actor marketplace alternative to a self-hosted crawler.

Open Questions

Cloud API pricing and GA date are unannounced (closed beta; “apply for early access”).
Anti-bot posture head-to-head with dedicated stealth tools like CloakBrowser’s source-level Chromium patches is untested — crawl4ai advertises proxy escalation + Shadow-DOM flattening but publishes no bot-detection-suite pass rate.
Dual-use caveat: the anti-bot/proxy features carry the same per-target ToS and ethics considerations flagged for CloakBrowser; respect robots/ToS and crawl only what you’re permitted to.

Jonathon's AI Wiki

Explorer

crawl4ai — Open-Source LLM-Friendly Web Crawler & Scraper

Key Takeaways

Why It Matters for Agents & RAG

Implementation

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

crawl4ai — Open-Source LLM-Friendly Web Crawler & Scraper

Key Takeaways

Why It Matters for Agents & RAG

Implementation

Try It

Related

Open Questions

Graph View

Table of Contents

Backlinks