Source: ai-research/prompt-caching-anthropic-docs-2026-04-27.md, ai-research/prompt-caching-anthropic-blog-announcement-2026-04-27.md, ai-research/prompt-caching-anthropic-pricing-2026-04-27.md, ai-research/prompt-caching-anthropic-cookbook-notebook-2026-04-27.md
Agencies pay the cached-prefix tax over and over: the same brand voice doc, the same skill bundle, the same CLAUDE.md, the same do-not-say list, fed to Claude on every prompt. Prompt caching lets you pay full price once, then read the same prefix back for ten cents on the dollar. If you are running real client work — Smile Springs Family Dental’s blog calendar, fifty FLUQs scoring runs, a multi-agent pipeline — caching is not an optimization. It is the difference between an API bill that scales linearly with output and one that scales with cleverness.
Key Takeaways
- Cache writes cost 1.25x base input (5-minute TTL) or 2x (1-hour). Cache reads cost 0.1x — a 90% discount on every repeated token.
- A 5-min cache pays for itself after one hit. A 1-hour cache pays for itself after two hits.
- Sonnet 4.6 cache reads are 3.00 base. Opus 4.7 cache reads are 5.00 base.
- Minimum cacheable prefix is 2,048 tokens for Sonnet 4.6 and 4,096 for Opus 4.7. Below that, caching is silently skipped.
- Anthropic’s cookbook shows a 187K-token prefix going from 4.89s baseline to 1.48s on a hit — 3.3x faster.
- Caching is destroyed by anything that mutates the cached prefix: timestamps, dynamic ordering, mid-prefix file inserts.
How Prompt Caching Works
You mark part of a prompt with cache_control: {"type": "ephemeral"}. The first request processes the prefix, charges you 1.25x the base input rate to write it to cache, and serves the response. The second identical request — within five minutes — reads the prefix from cache at 0.1x base input. The output is identical to a non-cached call. The savings are entirely on the input side.
Two TTLs: 5 minutes (default, 1.25x write) and 1 hour (beta, 2x write). Every cache hit refreshes the TTL, so a busy session keeps the cache warm essentially for free. Up to four explicit breakpoints per request, or use automatic caching — one cache_control field at the top level — and Claude moves the breakpoint forward as conversations grow.
A cache miss is the expensive default. Hits require the prefix to be byte-identical to a previously cached entry. Change one token, even an invisible one, and you eat a fresh cache write plus the full input rate for everything after. The system looks back up to 20 blocks from your breakpoint to find earlier hits, but past 20 blocks the entry is gone unless you add another breakpoint.
The Math for Agencies
Smile Springs Family Dental scenario: WEO Marketly runs a content calendar workflow on Sonnet 4.6. The system prompt — brand voice doc, do-not-say list, dental SEO skill bundle, persona profile — is 10,000 tokens. The team runs 50 prompts in a one-hour planning session.
Without caching, prefix cost only: 50 × 10,000 × 1.50**
With 5-min caching (refreshed by hits):
- 1 cache write: 10,000 × 0.0375
- 49 cache reads: 49 × 10,000 × 0.147
- Total: $0.1845
That is 87.7% off the prefix cost — 264 / week on prefix alone, before output and per-prompt input. On Opus 4.7 the same scenario saves $4.41 per session — caching matters more the more expensive the model.
When Caching Backfires
- Timestamps inside the cached block. “Today is 2026-04-27” injected into the system prompt makes the prefix unique every day. Caching never hits.
- Dynamic file uploads before the cached content. Cache is positional. If you put a per-request PDF before the static skill bundle, the bundle never caches.
- Reordering tools or system blocks. Tool definitions sit at the top of the cache hierarchy — touch them and everything after invalidates.
- Micro-edits to the prefix. Fixing a typo in the system prompt costs you the entire warm cache across every session running that prompt.
- Prefix below the minimum. A 1,500-token system prompt on Sonnet 4.6 is too short to cache (2,048-token floor). No error, no warning — just silent full-price billing.
Agency Patterns That Win
- Stack the static stuff at the front. Brand voice, skill bundles, persona, examples — all before any per-request content. Put your
cache_controlon the last unchanging block. - Use 1-hour TTL for batch runs. A blog generation job hitting Claude 30+ times in 90 minutes pays the 2x write once, then 0.1x for the rest. Cheaper than 18 separate 5-min writes.
- Move dynamic data into the user message, not the system prompt. The client brief, today’s date, the specific URL — all of those go after the cached prefix.
- Watch
cache_read_input_tokensin the response usage. If it’s zero on call two, your cache is missing — diagnose before you burn through a billing cycle paying full input.
Related
- Cost & Intelligence Levers
- Extended Thinking
- Opus 4.7 Best Practices
- Claude Prompting Best Practices
- Prompt Engineering
- Cross-Topic Connections
Try It
- Audit one production prompt. Count tokens in the static prefix (skill bundle, system, examples). If it’s over 2,048 on Sonnet or 4,096 on Opus, you have a caching candidate.
- Add
cache_control: {"type": "ephemeral"}to the last static block. Run the prompt twice within 5 minutes. Diffcache_read_input_tokensbetween call one and call two — call two should be near-equal to your prefix length. - For any workflow that runs >5 prompts in an hour, switch to
ttl: "1h". The 2x write pays back after the second hit and you stop re-warming caches between sessions.