Ollama + Claude Code — Cutting Token Costs with Local and Open Models

Source: Ollama + Claude Code = 99% Cheaper (YouTube video, https://youtu.be/sboNwYmH3AY)

Claude Code is the harness; the model is the engine. By default that engine is Opus, Sonnet, or Haiku via Anthropic’s API — which is what generates token charges and rate limits. This source walks through swapping the engine for an open-weight model running either locally via Ollama or routed through OpenRouter’s free tier, keeping the Claude Code agent harness intact while collapsing per-token cost. Anthropic ToS allows it because you keep using their harness; you just stop hitting their inference endpoint.

Key Takeaways

Two routes to “free” Claude Code: local Ollama models on your own hardware, or OpenRouter’s free model tier in the cloud. Both swap the engine without changing the harness.
Open-weight models have caught up to mid-tier closed models. On SWE-bench Verified, top open models now beat Sonnet 3.7. Opus 4.6 still leads for high-stakes work.
Hardware constrains local quality. A 9B Qwen 3.5 (~6.6 GB) runs on a typical laptop but feels significantly slower than Anthropic’s API; larger models require more RAM/GPU.
Context windows are smaller and lie by default. Ollama models often advertise 200k context but ship with a much smaller default — explicit Modelfile overrides (e.g., 64k) are required to make Claude Code’s system prompt fit.
Tool-call visibility degrades on smaller models. A 9B local model “spins forever then responds”; larger cloud-hosted opens (e.g., MiniMax M 2.7 via Ollama Cloud) restore the streaming tool-call view.
OpenRouter’s free tier requires the four-variable override. Setting only ANTHROPIC_MODEL silently leaves Haiku/Sonnet on Anthropic’s paid endpoints for tool calls — you must override every model slot.
Cost framing: even when not strictly “free,” Gemma 4 31B via OpenRouter is ~ $0.14/ M in p u t v s Op u s 4. 6^{'} s$ 5/M — roughly 35x cheaper per token, which the source frames as the more realistic win.

Implementation

Tool/Service: Ollama (local) + Claude Code, with optional Ollama Cloud and OpenRouter as alternates.

Setup:

Local-only path (free, slow, private):

Download Ollama from ollama.com for your OS.
ollama pull qwen3.5:9b (or pick a size your RAM/GPU can handle).
Optional but recommended — create a context-extended variant via a Modelfile so Claude Code’s system prompt fits (default Ollama context is often <8k even when the model card advertises 200k).
Launch Claude Code through Ollama’s helper: ollama launch claude then pick the local model.

Cloud-hosted open model via Ollama Cloud (faster, paid above free tier):

From Ollama, use the same ollama launch claude flow but pick a cloud-only model (e.g., MiniMax M 2.7) — sign in at ollama.com to authorize.

OpenRouter path (cheapest cloud, free tier with rate limits):

Create an OpenRouter account at openrouter.ai. Add ~$10 to lift the free-model rate limit from 50/day to 1,000/day.
Create an API key.

Edit .claude/settings.local.json and override the API base + every model slot:

{
  "env": {
    "ANTHROPIC_BASE_URL": "https://openrouter.ai/api/v1",
    "ANTHROPIC_AUTH_TOKEN": "<openrouter-key>",
    "ANTHROPIC_API_KEY": "",
    "ANTHROPIC_MODEL": "qwen/qwen3.5-coder:free",
    "ANTHROPIC_SMALL_FAST_MODEL": "qwen/qwen3.5-coder:free",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "qwen/qwen3.5-coder:free",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "qwen/qwen3.5-coder:free"
  }
}

Launch claude — header should read “OpenRouter free, API billing usage.”

Cost: Local Ollama: $0/ t o k e n, ha r d w a re - b o u n d . Op e n R o u t er f ree m o d e l s :$ 0/token within 50/day (no credits) or 1,000/day ( $10 d e p os i t) . Op e n R o u t er p ai d o p e n s (e . g ., G e mma 431 B) :$ 0.14/M input, ~ $0.40/ M o u tp u t v s Op u s 4. 6^{'} s$ 5/M and $25/ M . A n t h ro p i co nb o a r d in g s t i ll re q u i res a o n e - t im e$ 5 credit purchase to activate the API key, but the credit is never consumed if every model slot routes elsewhere.

Integration notes: Anthropic ToS explicitly allows this — you’re using the agent harness, just pointing it at a different inference endpoint. Tool calling is the main fragility: smaller open models may not match Claude’s expected JSON tool-call protocol, and web-search tools that ship as native Claude capabilities (not MCP) often fail on swapped engines — fall back to Brave/Tavily/Perplexity MCP servers explicitly.

Try It

Pick a tier: local for privacy + zero ongoing cost, OpenRouter for speed at near-zero cost.
For local, ollama pull a model sized to your RAM (rule of thumb from the source: ask Claude Code itself “here are my specs, which model size fits?”).
If context errors appear, build a Modelfile variant with PARAMETER num_ctx 64000 (or larger) and re-launch — Ollama’s default context is the silent failure point.
Route low-stakes operations to the cheap engine: file reads, grep-style searches, scaffolding, classification, triage. Keep Opus for high-stakes architectural decisions and security-sensitive code.
When using OpenRouter, double-check your usage logs after the first session — if Haiku charges appear, you missed one of the four model env vars.

Jonathon's AI Wiki

Explorer

Ollama + Claude Code — Cutting Token Costs with Local and Open Models

Key Takeaways

Implementation

Try It

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

Ollama + Claude Code — Cutting Token Costs with Local and Open Models

Key Takeaways

Implementation

Try It

Related

Graph View

Table of Contents

Backlinks