Source: Ollama + Claude Code = 99% Cheaper (YouTube video, https://youtu.be/sboNwYmH3AY)

Claude Code is the harness; the model is the engine. By default that engine is Opus, Sonnet, or Haiku via Anthropic’s API — which is what generates token charges and rate limits. This source walks through swapping the engine for an open-weight model running either locally via Ollama or routed through OpenRouter’s free tier, keeping the Claude Code agent harness intact while collapsing per-token cost. Anthropic ToS allows it because you keep using their harness; you just stop hitting their inference endpoint.

Key Takeaways

  • Two routes to “free” Claude Code: local Ollama models on your own hardware, or OpenRouter’s free model tier in the cloud. Both swap the engine without changing the harness.
  • Open-weight models have caught up to mid-tier closed models. On SWE-bench Verified, top open models now beat Sonnet 3.7. Opus 4.6 still leads for high-stakes work.
  • Hardware constrains local quality. A 9B Qwen 3.5 (~6.6 GB) runs on a typical laptop but feels significantly slower than Anthropic’s API; larger models require more RAM/GPU.
  • Context windows are smaller and lie by default. Ollama models often advertise 200k context but ship with a much smaller default — explicit Modelfile overrides (e.g., 64k) are required to make Claude Code’s system prompt fit.
  • Tool-call visibility degrades on smaller models. A 9B local model “spins forever then responds”; larger cloud-hosted opens (e.g., MiniMax M 2.7 via Ollama Cloud) restore the streaming tool-call view.
  • OpenRouter’s free tier requires the four-variable override. Setting only ANTHROPIC_MODEL silently leaves Haiku/Sonnet on Anthropic’s paid endpoints for tool calls — you must override every model slot.
  • Cost framing: even when not strictly “free,” Gemma 4 31B via OpenRouter is ~5/M — roughly 35x cheaper per token, which the source frames as the more realistic win.

Implementation

Tool/Service: Ollama (local) + Claude Code, with optional Ollama Cloud and OpenRouter as alternates.

Setup:

Local-only path (free, slow, private):

  1. Download Ollama from ollama.com for your OS.
  2. ollama pull qwen3.5:9b (or pick a size your RAM/GPU can handle).
  3. Optional but recommended — create a context-extended variant via a Modelfile so Claude Code’s system prompt fits (default Ollama context is often <8k even when the model card advertises 200k).
  4. Launch Claude Code through Ollama’s helper: ollama launch claude then pick the local model.

Cloud-hosted open model via Ollama Cloud (faster, paid above free tier):

  1. From Ollama, use the same ollama launch claude flow but pick a cloud-only model (e.g., MiniMax M 2.7) — sign in at ollama.com to authorize.

OpenRouter path (cheapest cloud, free tier with rate limits):

  1. Create an OpenRouter account at openrouter.ai. Add ~$10 to lift the free-model rate limit from 50/day to 1,000/day.
  2. Create an API key.
  3. Edit .claude/settings.local.json and override the API base + every model slot:
    {
      "env": {
        "ANTHROPIC_BASE_URL": "https://openrouter.ai/api/v1",
        "ANTHROPIC_AUTH_TOKEN": "<openrouter-key>",
        "ANTHROPIC_API_KEY": "",
        "ANTHROPIC_MODEL": "qwen/qwen3.5-coder:free",
        "ANTHROPIC_SMALL_FAST_MODEL": "qwen/qwen3.5-coder:free",
        "ANTHROPIC_DEFAULT_HAIKU_MODEL": "qwen/qwen3.5-coder:free",
        "ANTHROPIC_DEFAULT_SONNET_MODEL": "qwen/qwen3.5-coder:free"
      }
    }
  4. Launch claude — header should read “OpenRouter free, API billing usage.”

Cost: Local Ollama: 0/token within 50/day (no credits) or 1,000/day (0.14/M input, ~5/M and 5 credit purchase to activate the API key, but the credit is never consumed if every model slot routes elsewhere.

Integration notes: Anthropic ToS explicitly allows this — you’re using the agent harness, just pointing it at a different inference endpoint. Tool calling is the main fragility: smaller open models may not match Claude’s expected JSON tool-call protocol, and web-search tools that ship as native Claude capabilities (not MCP) often fail on swapped engines — fall back to Brave/Tavily/Perplexity MCP servers explicitly.

Try It

  • Pick a tier: local for privacy + zero ongoing cost, OpenRouter for speed at near-zero cost.
  • For local, ollama pull a model sized to your RAM (rule of thumb from the source: ask Claude Code itself “here are my specs, which model size fits?”).
  • If context errors appear, build a Modelfile variant with PARAMETER num_ctx 64000 (or larger) and re-launch — Ollama’s default context is the silent failure point.
  • Route low-stakes operations to the cheap engine: file reads, grep-style searches, scaffolding, classification, triage. Keep Opus for high-stakes architectural decisions and security-sensitive code.
  • When using OpenRouter, double-check your usage logs after the first session — if Haiku charges appear, you missed one of the four model env vars.