Source: wiki synthesis: Prompt Caching for Agencies, Extended Thinking (API Reference), Module 1 — Prompts as Reusable Artifacts, CCA-F Technical Reference, Claude Prompting Best Practices
Four structural decisions recur across production prompt design — cache or don’t, force extended thinking or leave it adaptive, enforce a schema or leave output loose, put content in the system prompt or the user turn — and the wiki had answers to all four scattered across claude-ai/, connections/, and the onboarding course, but nothing pulling them together for prompt-engineering/. This article is that consolidation.
Key Takeaways
- Caching breaks the moment anything in the cached prefix changes — timestamps, dynamic file order, tool-definition reordering, or even a one-character typo fix all force a full-price re-write. The fix is structural: stack everything static (brand voice, skill bundle, examples) at the front, put per-request content after the cache breakpoint, and switch to 1-hour TTL once a workflow crosses roughly 5 calls/hour.
- Extended/adaptive thinking helps on reasoning-heavy, high-stakes, or validation work and actively hurts on creative or routine fill-in-the-blank tasks — it doesn’t make Claude more creative, just slower and more expensive on tasks that don’t need deliberation.
- Enforce structured output when something downstream parses it; leave it loose for anything a human reads as prose.
tool_use/JSON-schema output eliminates syntax errors (malformed JSON, wrong types) but does not catch semantic errors (a total that doesn’t sum, a value in the wrong field) — schema conformance and correctness are different guarantees. - System prompt = stable identity/role/rules; user turn = anything that changes per request — partly a caching decision (dynamic content in the system prompt kills your cache hit rate) and partly a clarity decision (the system prompt is where “who Claude is for this task” lives; the user turn is where “what to do right now” lives). For RAG specifically, retrieved documents belong in the user turn, positioned above the query, because they change every request and would otherwise invalidate the cached system prompt on every call. ^[inferred — this specific RAG placement rule combines two separately-sourced facts (long-context positioning + cache-invalidation-by-dynamic-content) rather than being stated as a single rule in either source.]
1. Prompt Caching Strategy — When the TTL Breaks Down and How to Structure For It
Two TTLs exist: 5-minute (default, 1.25x write cost) and 1-hour (beta, 2x write cost). Every cache hit refreshes the TTL, so a busy session stays warm essentially for free — the failure mode isn’t the clock running out mid-session, it’s the prefix silently stopping being byte-identical. Five concrete ways caching breaks, all of which produce no error or warning — just silent full-price billing:
- Timestamps inside the cached block — “Today is 2026-07-02” in the system prompt makes the prefix unique every day.
- Dynamic file uploads placed before the cached content — cache is positional; a per-request PDF ahead of the static skill bundle means the bundle never caches.
- Reordering tools or system blocks — tool definitions sit at the top of the cache hierarchy; touching them invalidates everything after.
- Micro-edits to the prefix — fixing a single typo in a system prompt costs the entire warm cache across every session running that prompt.
- Prefix below the minimum cacheable length — 2,048 tokens (Sonnet 4.6) / 4,096 tokens (Opus 4.7). Below that, caching is silently skipped, not degraded.
Structural pattern that wins in practice: stack all static content (brand voice doc, skill bundle, persona, examples) at the front with the cache_control breakpoint on the last unchanging block; put dynamic, per-request content (the client brief, today’s date, the specific document) after the breakpoint, in the user turn. Switch from 5-minute to 1-hour TTL once a workflow crosses roughly 5 calls in an hour — the 2x write pays back after the second hit and stops the between-session cache-rewarming cost. Watch cache_read_input_tokens in the response usage field; if it’s zero on the second call, the cache is missing and needs diagnosis before the billing cycle closes. A real production example: a Meta-API DM-response agent that must re-send a full multi-location menu on every message achieves a 97% measured cache hit rate by keeping the menu/rules block completely static and only varying the live conversation turn — see Prompt Caching for Agencies for the full write-up and cost math.
2. Extended Thinking — When It Helps vs. When It Wastes Tokens
Extended thinking (manual budget_tokens, deprecated but functional on 4.6-era models) and adaptive thinking (the Opus 4.7+/Fable 5 default, tuned via the effort parameter) both add latency and cost in exchange for deliberation. The task-category breakdown that answers “when does it help vs. hurt”:
Reach for explicit/forced thinking on:
- Reasoning-heavy artifacts — audits, teardowns, multi-criteria comparisons — anything where Claude needs to weigh evidence rather than retrieve it.
- High-stakes one-shots, where getting it wrong on the first try costs more editing time than the extra latency costs.
- Validation steps specifically — checking a draft against an enumerated rule set is a careful, step-by-step task that benefits from deliberation.
Skip it on:
- Creative work (headlines, hero copy, ad variants) — extended thinking doesn’t make output more creative, only slower and more expensive.
- Routine fill-in-the-blank tasks where a well-built reusable prompt scaffold already does the reasoning work structurally.
- Anything in a fast iteration loop, where the added latency actively hurts the workflow.
Mechanically: adaptive thinking is the right default for general work because the model decides per-prompt whether deliberation is warranted based on perceived difficulty. The reason to override it with an explicit thinking budget is when you don’t trust the model’s own difficulty read for a specific artifact — a teardown or audit that’s “supposed” to be hard shouldn’t get skipped because the model judges it easy on a quick read. One hard constraint regardless of category: changing thinking parameters (enabled/disabled, budget, or display mode) between calls invalidates prompt-cache message breakpoints — see Extended Thinking for the full interaction matrix. On Opus 4.7 and later, manual budget_tokens returns a 400 error outright; the lever is effort (low/medium/high/xhigh/max), not a token count.
3. Structured Output vs. Free-Form + Post-Hoc Parsing
Enforce a schema (JSON, tool_use, strict output format) when:
- The output feeds directly into another tool — a spreadsheet import, a CMS, a downstream Claude call.
- The reader is scanning for specific fields (a comparison matrix, a brief spec) rather than reading prose.
- Completeness needs to be programmatically verifiable — every row must have all N columns, and an empty field should be detectable as an error rather than silently missing.
Leave output loose when:
- The output is creative work — forcing a strict schema onto headlines or ad copy flattens voice, because Claude starts treating each field as a slot to fill rather than a piece of writing.
- The right shape isn’t known yet — early prompt iterations benefit from letting Claude experiment with structure before locking one in.
- A human is going to read it as prose.
The failure mode enforcement doesn’t fix: tool_use with a JSON schema eliminates syntax errors (malformed JSON, wrong type, missing required field) but does not prevent semantic errors — a stated total that doesn’t match the sum of line items, a value placed in a plausible-but-wrong field. Schema conformance is a syntactic guarantee, not a correctness guarantee; catching semantic errors needs the self-correction pattern (ask for both a stated_total and a calculated_total, flag conflict_detected if they differ) documented in Prompt Engineering Essentials, layered on top of, not instead of, the schema.
4. System Prompt vs. User-Turn Placement
System prompt — role and identity (“You are a senior dental marketing copywriter…”), stable rules and constraints, banned-phrase lists, anything that should be true for every call in a session. This is also the highest-value target for prompt caching, since it’s the part most likely to be byte-identical across repeated calls.
User turn — the specific task/query, per-request variable data, and (critically) anything with a timestamp, a today’s-date reference, or content that changes on every call. Putting dynamic data in the system prompt is one of the five documented ways to silently kill a cache hit rate (see Section 1); putting it in the user turn instead is simultaneously the cache-correct move and — per Claude Prompting Best Practices’s long-context guidance — the quality-correct move, since long-form data belongs near the top of a prompt, above the query, and quality on complex multi-document prompts improves roughly 30% with that positioning.
For RAG specifically ^[inferred]: retrieved documents should go in the user turn (not the system prompt), wrapped in <document>/<document_content>/<source> tags, positioned above the actual query within that turn. Two independently-sourced facts combine to this rule: retrieved content changes on every query (so it can’t live in the cacheable system prompt without breaking the cache on every single call), and long-form context performs better positioned before the query rather than after it. A RAG system that tries to cache retrieved documents by stuffing them into the system prompt will pay a full cache-write cost on every request — the exact “dynamic file uploads before the cached content” failure mode from Section 1, just relocated from “PDF” to “retrieved chunk.”
Related
- Prompt Caching for Agencies — the full cost math and failure-mode catalog Section 1 summarizes.
- Extended Thinking (API Reference) — the authoritative API reference Section 2 summarizes, including the full per-model compatibility matrix.
- Module 1 — Prompts as Reusable Artifacts — source of the structured-output enforce/leave-loose framework and the Reasoning Controls section.
- CCA-F Technical Reference — source of the syntax-vs-semantic-error distinction in Section 3.
- Claude Prompting Best Practices — the general reference; source of the long-context positioning guidance underlying Section 4.
- Prompt Engineering Essentials — the self-correction pattern that catches what schema enforcement alone misses.
- Prompt Engineering — topic index.
Try It
- Audit one production prompt against the five cache-breaking patterns in Section 1. Timestamps, dynamic content before the cache breakpoint, reordered tools, prefix micro-edits, and below-minimum prefix length are all silent — check
cache_read_input_tokenson a repeated call to confirm you’re actually hitting. - Pick one prompt currently forcing extended thinking and one currently on default/adaptive; swap them. If the forced-thinking prompt is creative work, try adaptive and measure whether quality actually drops (often it doesn’t, and latency improves). If the adaptive prompt is a teardown/audit, try forcing deliberation and compare output depth.
- For your next RAG integration, confirm retrieved documents are landing in the user turn, above the query, and not in the system prompt — this is the single most common RAG performance-and-cost mistake per the combined caching and long-context guidance above.
Open Questions
- No published numbers exist for the token-cost delta of adding the standard verbosity-reducing snippet, or for extended-thinking task-category quality deltas specifically — the guidance in Section 2 is a synthesis of onboarding-course framing plus the API reference, not a benchmarked A/B result. Worth a dedicated hands-on benchmark if this becomes a recurring optimization target.