Prompt Caching for Agencies — When It Saves You Money

Source: ai-research/prompt-caching-anthropic-docs-2026-04-27.md, ai-research/prompt-caching-anthropic-blog-announcement-2026-04-27.md, ai-research/prompt-caching-anthropic-pricing-2026-04-27.md, ai-research/prompt-caching-anthropic-cookbook-notebook-2026-04-27.md

Agencies pay the cached-prefix tax over and over: the same brand voice doc, the same skill bundle, the same CLAUDE.md, the same do-not-say list, fed to Claude on every prompt. Prompt caching lets you pay full price once, then read the same prefix back for ten cents on the dollar. If you are running real client work — Smile Springs Family Dental’s blog calendar, fifty FLUQs scoring runs, a multi-agent pipeline — caching is not an optimization. It is the difference between an API bill that scales linearly with output and one that scales with cleverness.

Key Takeaways

Cache writes cost 1.25x base input (5-minute TTL) or 2x (1-hour). Cache reads cost 0.1x — a 90% discount on every repeated token.
A 5-min cache pays for itself after one hit. A 1-hour cache pays for itself after two hits.
Sonnet 4.6 cache reads are $0.30/ MT o k v s$ 3.00 base. Opus 4.7 cache reads are $0.50/ MT o k v s$ 5.00 base.
Minimum cacheable prefix is 2,048 tokens for Sonnet 4.6 and 4,096 for Opus 4.7. Below that, caching is silently skipped.
Anthropic’s cookbook shows a 187K-token prefix going from 4.89s baseline to 1.48s on a hit — 3.3x faster.
Caching is destroyed by anything that mutates the cached prefix: timestamps, dynamic ordering, mid-prefix file inserts.

How Prompt Caching Works

You mark part of a prompt with cache_control: {"type": "ephemeral"}. The first request processes the prefix, charges you 1.25x the base input rate to write it to cache, and serves the response. The second identical request — within five minutes — reads the prefix from cache at 0.1x base input. The output is identical to a non-cached call. The savings are entirely on the input side.

Two TTLs: 5 minutes (default, 1.25x write) and 1 hour (beta, 2x write). Every cache hit refreshes the TTL, so a busy session keeps the cache warm essentially for free. Up to four explicit breakpoints per request, or use automatic caching — one cache_control field at the top level — and Claude moves the breakpoint forward as conversations grow.

A cache miss is the expensive default. Hits require the prefix to be byte-identical to a previously cached entry. Change one token, even an invisible one, and you eat a fresh cache write plus the full input rate for everything after. The system looks back up to 20 blocks from your breakpoint to find earlier hits, but past 20 blocks the entry is gone unless you add another breakpoint.

The Math for Agencies

Smile Springs Family Dental scenario: WEO Marketly runs a content calendar workflow on Sonnet 4.6. The system prompt — brand voice doc, do-not-say list, dental SEO skill bundle, persona profile — is 10,000 tokens. The team runs 50 prompts in a one-hour planning session.

Without caching, prefix cost only: 50 × 10,000 × $3/1, 000, 000 = * *$ 1.50**

With 5-min caching (refreshed by hits):

1 cache write: 10,000 × $3.75/1, 000, 000 =$ 0.0375
49 cache reads: 49 × 10,000 × $0.30/1, 000, 000 =$ 0.147
Total: $0.1845

That is 87.7% off the prefix cost — $1.32 s a v e d o na s in g l esess i o n . M u lt i pl y b y 200 sess i o n s a w ee ka cross a ll c l i e n t s an d yo u a re l oo kin g a t$ 264 / week on prefix alone, before output and per-prompt input. On Opus 4.7 the same scenario saves $4.41 per session — caching matters more the more expensive the model.

When Caching Backfires

Timestamps inside the cached block. “Today is 2026-04-27” injected into the system prompt makes the prefix unique every day. Caching never hits.
Dynamic file uploads before the cached content. Cache is positional. If you put a per-request PDF before the static skill bundle, the bundle never caches.
Reordering tools or system blocks. Tool definitions sit at the top of the cache hierarchy — touch them and everything after invalidates.
Micro-edits to the prefix. Fixing a typo in the system prompt costs you the entire warm cache across every session running that prompt.
Prefix below the minimum. A 1,500-token system prompt on Sonnet 4.6 is too short to cache (2,048-token floor). No error, no warning — just silent full-price billing.

Agency Patterns That Win

Stack the static stuff at the front. Brand voice, skill bundles, persona, examples — all before any per-request content. Put your cache_control on the last unchanging block.
Use 1-hour TTL for batch runs. A blog generation job hitting Claude 30+ times in 90 minutes pays the 2x write once, then 0.1x for the rest. Cheaper than 18 separate 5-min writes.
Move dynamic data into the user message, not the system prompt. The client brief, today’s date, the specific URL — all of those go after the cached prefix.
Watch cache_read_input_tokens in the response usage. If it’s zero on call two, your cache is missing — diagnose before you burn through a billing cycle paying full input.

Try It

Audit one production prompt. Count tokens in the static prefix (skill bundle, system, examples). If it’s over 2,048 on Sonnet or 4,096 on Opus, you have a caching candidate.
Add cache_control: {"type": "ephemeral"} to the last static block. Run the prompt twice within 5 minutes. Diff cache_read_input_tokens between call one and call two — call two should be near-equal to your prefix length.
For any workflow that runs >5 prompts in an hour, switch to ttl: "1h". The 2x write pays back after the second hit and you stop re-warming caches between sessions.

Jonathon's AI Wiki

Explorer

Prompt Caching for Agencies — When It Saves You Money

Key Takeaways

How Prompt Caching Works

The Math for Agencies

When Caching Backfires

Agency Patterns That Win

Try It

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

Prompt Caching for Agencies — When It Saves You Money

Key Takeaways

How Prompt Caching Works

The Math for Agencies

When Caching Backfires

Agency Patterns That Win

Related

Try It

Graph View

Table of Contents

Backlinks