Source: ai-research/crawl4ai-github-readme-2026-06-16.md (https://github.com/unclecode/crawl4ai)
crawl4ai (by unclecode) is the most-starred web crawler on GitHub (~68.7k stars, 2026-06-16) — an Apache-2.0 Python library that turns arbitrary web pages into clean, LLM-ready Markdown for RAG, agents, and data pipelines. Unlike hosted scraping APIs it runs fully local with no API keys, accounts, or rate limits, and ships a Docker server exposing a REST API and an MCP server so Claude Code and other agents can call it directly.
Key Takeaways
- “Turns the web into clean, LLM-ready Markdown.” The core output preserves headings, tables, and code blocks, plus a “Fit Markdown” variant that strips boilerplate/nav noise via
PruningContentFilterorBM25ContentFilter— cutting tokens before content reaches an LLM. - Two extraction paths, one cost lever.
JsonCssExtractionStrategypulls structured data with CSS/XPath selectors and no LLM call (fast, free, deterministic);LLMExtractionStrategyuses a model (any OSS or proprietary provider) for schema-based extraction when rules aren’t enough. Prefer the CSS schema to avoid token spend; reach for the LLM strategy only for messy pages. ^[inferred] - Real browser control. Built on Playwright: JS execution with dynamic-content waits, session persistence/auth, proxy support, custom headers/cookies/UA, Shadow-DOM flattening, lazy-load + full-page scroll, screenshots and PDF capture, iframe extraction, and an async browser pool for concurrent crawling.
- Three deploy shapes. (1)
pip install crawl4aiPython library; (2) Docker server (unclecode/crawl4ai) with a REST API, monitoring dashboard, interactive playground, browser pooling, and JWT auth; (3) MCP server on the Docker deployment for direct connection to Claude Code and other AI tools. - Free and unmetered. Apache-2.0, no keys or rate limits — you own the runtime. A commercial Cloud API is in closed beta (pitched as “drastically more cost-effective” than existing solutions).
- The self-hosted OSS counterpart to managed fetch/scrape APIs like TinyFish Fetch, Browserbase, and Firecrawl — trade managed reliability for zero cost and full control. ^[inferred]
Why It Matters for Agents & RAG
- Markdown is what LLMs parse natively. crawl4ai is the client-side inverse of the agent-readable-web thesis: instead of waiting for sites to publish agent-friendly formats, it converts the existing human web into the markdown payload agents read best — on the fly, without the site’s cooperation. ^[inferred]
- Token economy is a first-class feature. Fit Markdown, content filters, and CSS-schema extraction all exist to reduce what you feed a model, which is the dominant cost in any scrape-to-LLM pipeline. ^[inferred]
- Drop-in for Claude Code. The Docker MCP server lets an agent crawl-to-markdown as a tool call rather than shelling out to a custom script.
Implementation
Tool/Service: crawl4ai — github.com/unclecode/crawl4ai · Apache-2.0 · Python · v0.8.9 (2026-06-04) · ~68.7k★
Setup (Python library):
pip install -U crawl4ai
crawl4ai-setup # installs/configures the Playwright browsersimport asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://www.nbcnews.com/business")
print(result.markdown) # clean, LLM-ready markdown
asyncio.run(main())Setup (Docker server + REST API + MCP):
docker pull unclecode/crawl4ai:latest
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest
# dashboard: http://localhost:11235/dashboard
# REST API: HTML extract, screenshot, PDF, JS execution (JWT-gated)
# MCP server: connects to Claude Code and other AI toolsCost: Free / open-source; no API keys or rate limits. Commercial Cloud API in closed beta.
Integration notes:
- Extraction strategies:
JsonCssExtractionStrategy(CSS/XPath, no LLM),LLMExtractionStrategy(provider-configurable),LLMTableExtraction(chunks large tables for LLM parsing). - Fit Markdown is produced by configuring a content filter —
PruningContentFilterorBM25ContentFilter. Chunking is topic / regex / sentence based, with cosine-similarity semantic selection. - Recent releases: v0.8.9 SSRF/Docker-API security patches; v0.8.5 anti-bot with proxy escalation + Shadow-DOM flattening; v0.8.0 deep-crawl crash recovery (
resume_state/on_state_change) +prefetch=True(5-10× faster URL discovery). --shm-size=1gmatters: Chromium needs shared memory and crashes under concurrency without it.
Try It
- Smoke test in 60 seconds:
pip install -U crawl4ai && crawl4ai-setup, then run theAsyncWebCrawlersnippet against a content-heavy URL. Add aPruningContentFilterand re-run to watch the Fit Markdown variant shrink the token count. - Wire it into Claude Code as a tool: run the Docker server and point an MCP client at the
:11235MCP endpoint — crawl-to-markdown becomes a callable tool instead of a bespoke script. - Skip the LLM where you can: for a known site, write a
JsonCssExtractionStrategyschema (CSS/XPath) to pull structured fields deterministically at zero token cost; fall back toLLMExtractionStrategyonly for unstructured pages. - Benchmark it against your hosted scraper: crawl the same 20 URLs through crawl4ai and your current TinyFish/Firecrawl path; compare token counts (Fit Markdown vs raw) and per-page cost before committing.
Related
- TinyFish — Web Infra APIs for AI Agents — managed Search/Fetch/Browser; crawl4ai is the self-hosted OSS counterpart (TinyFish’s launch names Firecrawl as a competitor).
- Browserbase Autobrowse — managed browser harness whose “probe with
fetchfirst, escalate only if dynamic/gated” lesson maps directly onto crawl4ai’s fetch-first crawl. - CloakBrowser — Playwright-replacement focused on anti-bot fingerprint patches; crawl4ai covers the same stealth surface (proxy escalation, Shadow-DOM flattening) inside a full crawler.
- Shopify Review Scraper — free, self-hosted, single-platform-deep Playwright scraper; same local-OSS posture, narrower scope.
- ScrapeCreators — paid social-platform-deep scraping API; the managed counterpart for sites a generic crawler can’t easily reach.
- The Agent-Readable Web — crawl4ai is the client-side inverse: it makes the human web agent-readable without the site re-plumbing itself.
- Cowork + Apify Scraping — per-actor marketplace alternative to a self-hosted crawler.
Open Questions
- Cloud API pricing and GA date are unannounced (closed beta; “apply for early access”).
- Anti-bot posture head-to-head with dedicated stealth tools like CloakBrowser’s source-level Chromium patches is untested — crawl4ai advertises proxy escalation + Shadow-DOM flattening but publishes no bot-detection-suite pass rate. ^[inferred]
- Dual-use caveat: the anti-bot/proxy features carry the same per-target ToS and ethics considerations flagged for CloakBrowser; respect robots/ToS and crawl only what you’re permitted to. ^[inferred]