Source: Ollama + Claude Code = 99% Cheaper (YouTube video, https://youtu.be/sboNwYmH3AY)
Claude Code is the harness; the model is the engine. By default that engine is Opus, Sonnet, or Haiku via Anthropic’s API — which is what generates token charges and rate limits. This source walks through swapping the engine for an open-weight model running either locally via Ollama or routed through OpenRouter’s free tier, keeping the Claude Code agent harness intact while collapsing per-token cost. Anthropic ToS allows it because you keep using their harness; you just stop hitting their inference endpoint.
Key Takeaways
- Two routes to “free” Claude Code: local Ollama models on your own hardware, or OpenRouter’s free model tier in the cloud. Both swap the engine without changing the harness.
- Open-weight models have caught up to mid-tier closed models. On SWE-bench Verified, top open models now beat Sonnet 3.7. Opus 4.6 still leads for high-stakes work.
- Hardware constrains local quality. A 9B Qwen 3.5 (~6.6 GB) runs on a typical laptop but feels significantly slower than Anthropic’s API; larger models require more RAM/GPU.
- Context windows are smaller and lie by default. Ollama models often advertise 200k context but ship with a much smaller default — explicit
Modelfileoverrides (e.g., 64k) are required to make Claude Code’s system prompt fit. - Tool-call visibility degrades on smaller models. A 9B local model “spins forever then responds”; larger cloud-hosted opens (e.g., MiniMax M 2.7 via Ollama Cloud) restore the streaming tool-call view.
- OpenRouter’s free tier requires the four-variable override. Setting only
ANTHROPIC_MODELsilently leaves Haiku/Sonnet on Anthropic’s paid endpoints for tool calls — you must override every model slot. - Cost framing: even when not strictly “free,” Gemma 4 31B via OpenRouter is ~5/M — roughly 35x cheaper per token, which the source frames as the more realistic win.
Implementation
Tool/Service: Ollama (local) + Claude Code, with optional Ollama Cloud and OpenRouter as alternates.
Setup:
Local-only path (free, slow, private):
- Download Ollama from
ollama.comfor your OS. ollama pull qwen3.5:9b(or pick a size your RAM/GPU can handle).- Optional but recommended — create a context-extended variant via a
Modelfileso Claude Code’s system prompt fits (default Ollama context is often <8k even when the model card advertises 200k). - Launch Claude Code through Ollama’s helper:
ollama launch claudethen pick the local model.
Cloud-hosted open model via Ollama Cloud (faster, paid above free tier):
- From Ollama, use the same
ollama launch claudeflow but pick a cloud-only model (e.g., MiniMax M 2.7) — sign in atollama.comto authorize.
OpenRouter path (cheapest cloud, free tier with rate limits):
- Create an OpenRouter account at
openrouter.ai. Add ~$10 to lift the free-model rate limit from 50/day to 1,000/day. - Create an API key.
- Edit
.claude/settings.local.jsonand override the API base + every model slot:{ "env": { "ANTHROPIC_BASE_URL": "https://openrouter.ai/api/v1", "ANTHROPIC_AUTH_TOKEN": "<openrouter-key>", "ANTHROPIC_API_KEY": "", "ANTHROPIC_MODEL": "qwen/qwen3.5-coder:free", "ANTHROPIC_SMALL_FAST_MODEL": "qwen/qwen3.5-coder:free", "ANTHROPIC_DEFAULT_HAIKU_MODEL": "qwen/qwen3.5-coder:free", "ANTHROPIC_DEFAULT_SONNET_MODEL": "qwen/qwen3.5-coder:free" } } - Launch
claude— header should read “OpenRouter free, API billing usage.”
Cost: Local Ollama: 0/token within 50/day (no credits) or 1,000/day (0.14/M input, ~5/M and 5 credit purchase to activate the API key, but the credit is never consumed if every model slot routes elsewhere.
Integration notes: Anthropic ToS explicitly allows this — you’re using the agent harness, just pointing it at a different inference endpoint. Tool calling is the main fragility: smaller open models may not match Claude’s expected JSON tool-call protocol, and web-search tools that ship as native Claude capabilities (not MCP) often fail on swapped engines — fall back to Brave/Tavily/Perplexity MCP servers explicitly.
Try It
- Pick a tier: local for privacy + zero ongoing cost, OpenRouter for speed at near-zero cost.
- For local,
ollama pulla model sized to your RAM (rule of thumb from the source: ask Claude Code itself “here are my specs, which model size fits?”). - If context errors appear, build a Modelfile variant with
PARAMETER num_ctx 64000(or larger) and re-launch — Ollama’s default context is the silent failure point. - Route low-stakes operations to the cheap engine: file reads, grep-style searches, scaffolding, classification, triage. Keep Opus for high-stakes architectural decisions and security-sensitive code.
- When using OpenRouter, double-check your usage logs after the first session — if Haiku charges appear, you missed one of the four model env vars.