Source: ai-research/chopratejas-headroom-readme-2026-06-13.md (the chopratejas/headroom README, fetched 2026-06-13) — discovered via Matthew Berman’s OSS-projects video (raw/You_NEED_to_try_these_open-source_AI_projects_RIGHT_NOW.md). Benchmarks below are creator-reported (confidence medium); the repo ships a reproduction harness.

Headroom (chopratejas/headroom, ~25.9K stars, Apache-2.0) is a local-first context-compression layer that sits between an AI agent and the LLM. It compresses everything the agent reads — tool outputs, logs, RAG chunks, files, and conversation history — before it reaches the model, claiming 60–95% token reduction “with the same answers.” It is the most-starred entrant in the crowded token-optimizer field this wiki tracks, and differentiates on three axes: it runs locally (data stays on your machine), it is reversible (originals are cached for on-demand retrieval), and it offers cross-agent shared memory.

Key Takeaways

  • Five ways to adopt it, from zero-code to inline. A proxy (headroom proxy --port 8787, any language, no code changes); an agent wrap (headroom wrap claude|codex|cursor|aider|copilot — starts the proxy and points the tool at it); a library (compress(messages) in Python or TypeScript); an MCP server (headroom_compress / headroom_retrieve / headroom_stats); and SDK middleware for Anthropic/OpenAI, Vercel AI SDK, LiteLLM, LangChain, Agno, and Strands.
  • Content-aware compression, not blunt truncation. A ContentRouter detects content type and routes to the right compressor: SmartCrusher for JSON, CodeCompressor (AST-aware) for Python/JS/Go/Rust/Java/C++, and Kompress-base (a HuggingFace model trained on agentic traces) for prose. A CacheAligner stabilizes prompt prefixes so Anthropic/OpenAI KV caches still hit after compression — important, since naive compression breaks prompt caching.
  • Reversible by design (CCR). Compressed-out originals are cached locally; if the model needs the full text it calls headroom_retrieve. This is the safety valve that separates it from lossy context trimming — the agent can recover detail on demand within a configured TTL.
  • Cross-agent memory + failure mining. headroom wrap claude --memory gives a shared, project-scoped, user-isolated memory store across Claude/Codex/Gemini with auto-dedup. headroom learn mines failed sessions and writes corrections back into CLAUDE.md / AGENTS.md — a self-improvement loop adjacent to the AIOS “fold learnings back into the system” dimension.
  • Creator-reported numbers (verify before quoting). Workload savings: code search 17,765 → 1,408 tokens (92%), SRE incident debugging 65,694 → 5,118 (92%), GitHub issue triage 54,174 → 14,761 (73%), codebase exploration 78,502 → 41,254 (47%). Accuracy held on standard benchmarks (N=100): GSM8K ±0.000, TruthfulQA +0.030, SQuAD v2 97% @ 19% compression, BFCL tools 97% @ 32% compression. The repo has CI + codecov + a python -m headroom.evals suite --tier 1 reproduction path.^[ambiguous]
  • When to skip (per the README). If you only use a single provider’s native compaction and don’t need cross-agent memory, or you run in a sandbox where local processes can’t start.

Implementation

Tool/Service: Headroom (chopratejas/headroom), Apache-2.0. PyPI headroom-ai, npm headroom-ai, HuggingFace model chopratejas/kompress-v2-base. Setup: pip install "headroom-ai[all]" (Python 3.10+) or npm install headroom-ai, then headroom wrap claude (or headroom proxy --port 8787). headroom perf reports the savings; granular extras include [proxy] [mcp] [ml] [code] [memory] [evals]. Cost: Free/OSS; runs locally, so the “cost” is the token savings (its purpose) plus local CPU/RAM for the embedder (Apple-GPU offload available via HEADROOM_EMBEDDER_RUNTIME=pytorch_mps). Integration notes: headroom wrap claude supports --memory and --code-graph; Codex shares the memory store with Claude; OpenClaw installs it as a ContextEngine plugin. Any OpenAI-compatible client works through the proxy.

Try It

  • Wrap Claude Code for one session and measure. headroom wrap claude then headroom perf — compare token consumption against an unwrapped session on the same task. The honest test is whether answer quality holds, not just the token delta.
  • Stress the reversibility. Run a task where the agent genuinely needs a detail that got compressed away, and confirm headroom_retrieve recovers it. If it can’t, the compression is lossier than advertised for your workload.
  • Compare against native compaction. Headroom is one of 12+ token optimizers catalogued in Claude Code Token Optimization; benchmark it against Claude Code’s own auto-compaction before adding a moving part. The Fable-5 field-test debate (“token-hungry is contested — it one-shots more often”) in the Fable 5 article is the backdrop: compression matters most for long agentic loops, less for one-shots.