Source: Getting more out of the Claude Platform (YouTube QIriO1-vHYw), Puneet Shah (Product Manager, Anthropic platform team), Code with Claude London 2026 (uploaded 2026-05-22, last session of the day). Transcript via local Whisper fallback (no YouTube captions).
The closing session of Code with Claude London 2026, delivered by the Anthropic PM who shipped the 1M context window, fast mode, and prompt-caching improvements. Puneet frames the Claude platform as “the layer on top of the models” — the production-grade features that turn raw intelligence into a real business. The talk uses a live HeroCorp demo (a fictional superheroes-for-hire company with a dashboard built on Claude) to walk through five compounding cost-and-quality optimizations: prompt caching, three context-engineering techniques (tool search, programmatic tool calling, compaction), and the advisor strategy. The HeroCorp dashboard cost drops from over 100 pounds per load to ~11 pounds across the demo while preserving Opus-level intelligence on the critical-path decisions.
Key Takeaways
- “If you remember nothing else, think about prompt caching.” Caching delivers three compounding benefits: 90% input-token discount, rate-limit boost (cached tokens don’t count — 80% cache hit ≈ 5× effective rate limit in practice), and lower time-to-first-token. Target 80%+ hit rate; named customers Replit, Cursor, Perplexity, and Claude Code all hit 90%+.
- First question to ask: what is your cache hit rate? New analytics pages live in console.anthropic.com next to cost and usage. As of “yesterday” (May 21), Anthropic shipped a feature showing why your cache broke — most common failure is putting a timestamp into the system prompt (the prefix changes every call, invalidating the cache).
- Starting from 0% cache hit rate is fine. Two on-ramps: one-line auto-caching, or the Claude API skill built into Claude Code (and many other coding agents) that diagnoses and rewrites your prompt ordering to maximize hits.
- Context engineering is the discipline of curating what Claude sees. Three techniques covered in order — pruning tools, pruning tool results, summarizing conversation history.
- Tool search tool. Define all tools up front, but pass Claude only a meta-tool (“call this when you need a tool”). Claude calls it, gets a tool list, then only the chosen tool’s full schema enters context. Lovable cut overall token consumption 10% and improved model performance with this single change — less gunk in context = better answers.
- Programmatic tool calling. Claude writes a short Python script that calls the same tools, curates the response, and streams only what’s relevant back to the conversation. Quora used this to strip irrelevant HTML from scraped pages. Pattern works because Claude is “really good at writing code.”
- Compaction. When you hit your context threshold, Claude summarizes the conversation, ships key facts forward, drops irrelevant turns, and keeps going — “almost unlimited feeling context.” You set the threshold (often 400K–500K, not the full 1M) and provide your own summary prompt to guide what gets preserved. Hex uses this in production.
- Advisor strategy is the Pareto-optimal cost/quality lever. Run the agent loop with a cheaper executor (Sonnet or Haiku) that calls an Opus “advisor” only for hard, oddly-shaped decisions. Mental model: junior engineer paired with a senior — junior does most of the keyboard work, senior reviews code, helps with architecture. Bolt uses this for better architectural decisions on complex tasks with no overhead on simple ones.
- “Watermelon” example from the demo. Sonnet read a renewal-deal transcript and marked it green. Opus, called as advisor, caught that the customer specifically wanted a superhero who would be unavailable on the event date — green outside, red inside. Sonnet-with-Opus-advisor recovered Opus-quality judgment at Sonnet pricing.
- “Look at the transcript of your agents.” Puneet repeats this multiple times. Most production wins come from reading what the model actually sees and asking whether it’s relevant.
- Threshold tuning is per-scenario. 1M context is great when you need it, but for many production agents 400K–500K hits the right intelligence/cost/latency mix. Start there.
- Recent launches called out in closing slide. Automatic prompt caching (one-line implementation) and “the Cloud platform on AWS” — the full platform feature set now available where many customers already run Anthropic models on AWS.
- Last 24 hours of shipping. Puneet flags that some features in the talk launched literally during the conference; the platform is evolving fast enough that the slide could not fit everything shipped in 2026.
The five-step optimization stack
The HeroCorp demo applies these in order, compounding each on the last:
- Prompt caching — zero cost change to intelligence; brings the dashboard from 0% cache hit to 58% on first pass (~50% cost reduction).
- Tool search tool — only definition of the called tool enters context (hero retention metrics tool is 14K tokens; hero list 6.1K; another 9.3K — all kept out of context until needed).
- Programmatic tool calling — Claude writes a Python script that loops through Gong sales-call transcripts, extracts aggregate sentiment only, drops the 30–60 minutes of verbatim audio commentary.
- Compaction — threshold set at 400K; conversation gets summarized down once it hits, then continues. User provides the summary prompt to guide what’s preserved.
- Advisor strategy — switch executor from Opus 4.7 to Sonnet 4.6 with Opus 4.7 advisor. Sonnet calls Opus only when it encounters odd-shaped decisions (renewal evaluations, architectural choices); simple tasks run cheap and fast.
End state from the demo: from “over 10× cost” to ~11 pounds per dashboard load, with no loss of intelligence on critical decisions (Opus still catches the watermelon).
Try It
- Open console.anthropic.com → caching analytics. Check your current prompt cache hit rate. If it’s under 80%, look at the new “why did my cache break” view — the most common culprit is dynamic content (timestamps, IDs, “what day is it”) leaking into the system prompt.
- Install the Claude API skill in Claude Code. Ask it to improve your application’s cache hit rate. It rewrites prompt ordering automatically. (See Claude skills ecosystem for the skill model and CLI vs MCP tool selection for when to reach for it.)
- Audit your agent’s tool schemas. Add up the token cost of every tool definition you’re passing. If it exceeds 5–10K combined and most tools are unused per turn, switch to tool search tool — Lovable cut 10% of total tokens with this alone.
- For long-running agents, set a compaction threshold below 1M. Try 400K first. Write your own summary prompt (what facts must survive compaction for the agent to stay on track?) and watch the transcript across the first compaction event.
- A/B-test advisor strategy on your most expensive agent run. Swap executor from Opus to Sonnet (or Haiku) and add an Opus advisor tool that the executor can call when it’s uncertain. Measure cost + quality vs. all-Opus baseline. Expect Pareto improvement on tasks with mixed complexity.
Related
- Advisor Strategy — separate deep-dive on the senior/junior model-pairing pattern Puneet calls “Pareto optimal.”
- Context Management in Claude Code — adjacent take on the same three context-engineering primitives applied inside Claude Code.
- Code with Claude London 2026 — Opening Keynote — the keynote that opened the conference Puneet closed; same platform thesis seen from Anthropic’s leadership angle.
- Managed Agents: Self-Hosted Sandboxes + MCP Tunnels — the other platform-feature announcement from the same week (same conference).
- Managed Agents in Production (Jess Ann + Lance Martin) — primitive-level walkthrough of the Anthropic-hosted agent stack; complements Puneet’s per-feature optimization framing.
- Anthropic SDK Releases — May 2026 — the SDK plumbing for the features Puneet demos (caching headers, sandbox config, compaction).
- Agent SDK — primitive overview that frames where Puneet’s features sit in the platform stack.
- Claude Skills Ecosystem — context for the “Claude API skill” Puneet recommends as the on-ramp to caching.
Open Questions
- Exact GA / rollout status of the analytics “why did my cache break” view — Puneet said “launched yesterday,” confirmable against console release notes.
- Concrete numbers on the HeroCorp demo are illustrative (30+ pounds → 11 pounds across the run) but Puneet does not break out per-step deltas in dollars — only that the final state is “about a third” of the starting cost after context engineering, before advisor strategy adds more.
- Bolt, Lovable, Quora, Hex customer-result numbers are spoken claims without published case studies linked in the talk; verify against Anthropic customer-story pages before citing externally.