Source: The thinking lever (Anthropic / Matt Bleifer — Research PM, Code with Claude 2026, May 7 2026 conference talk, YouTube OXJO4LldSnc)
Matt Bleifer (Research PM, Anthropic) walks through how Claude leverages compute at inference time — what test-time compute is, the three token types Claude spends (thinking, tool calling, text), the user-side levers that shape that spend (effort + task budgets), and the evolution of thinking modes (sequential thinking → interleaved thinking → adaptive thinking, the post-Opus-4.6 default). Anchored on a memorable traffic-simulation demo: same Opus 4.7 model on low / high / max effort produced strictly better results at 2× / 10× the token cost. The talk closes with three actionable items: enable thinking whenever possible, use effort + budgets to modulate, and if you’re not going to run evals, default to “extra high” effort for software engineering tasks.
Key Takeaways
- Test-time compute is real and scales like training compute. Just as bigger models trained longer get better, the same model spending more time per problem gets better. Lift is consistent across agentic coding, agentic search, computer use, PhD-level academic reasoning. Model intelligence + token budget compound — bigger model + more time = best results.
- The traffic-simulation demo is the talk’s anchor. Bleifer ran the same prompt (“create a realistic simulation of cars going down a one-way street at a traffic light”) on Opus 4.7 at three effort levels:
- Low effort — ~50 seconds, ~4,600 output tokens. Functional: cars do go down a one-way street, do stop at the light, but graphics are basic, traffic flow is basic, and Claude oddly placed the traffic light in the middle of the road.
- High effort — 2× the time, 2× the tokens. Cars of different types, traffic light correctly placed roadside, what Claude called an “intelligent driver model” where each car uniquely responds to the dynamics around it. Visibly better simulation.
- Max effort — 10× the time, 10× the tokens. Best graphics, “favorite traffic light,” realistic driving patterns. 10× the spend, qualitatively the best output.
- The three token types Claude spends. All three count toward both cost and waiting time:
- Thinking tokens — Claude’s internal monologue. Step-by-step reasoning, chain-of-thought, considering options, working through scratch-pad logic before acting.
- Tool calling tokens — Claude’s interface to the world. Search calls, file reads/writes, “really millions of different tools.”
- Text tokens — Claude’s interface to the user. Updates while working, summary at the end, simple-question responses.
- Two user-side levers to shape Claude’s spend.
- Effort dial — preference signal (low → medium → high → max → “extra high”). “How much should Claude trade off time, cost, and quality?” Claude takes this as a soft preference, allocates tokens accordingly. All Anthropic benchmarks since Opus 4.6 run on adaptive thinking with effort levels.
- Task budgets — hard upper bound. “Don’t spend more than 100,000 tokens before stopping and checking in.” Budgets can be tokens, time, or cost. Increasingly load-bearing as agents start running for days/weeks/months.
- Adaptive thinking is the latest evolution — and the default.
- First-gen reasoning models — sequential pattern: think first, then tool-call, then text. Mechanical and brittle.
- Interleaved thinking — Claude could think between tool calls, deciding based on results before continuing. A real improvement.
- Adaptive thinking (post-Opus 4.6 default) — Claude is free to think whenever appropriate. No constraint on when, how much, or in what order. Can lead with text to acknowledge the request, stop to call a tool, think about that result, respond to the user, continue calling tools, etc. Can also choose not to think at all for simple queries.
- In practice, Claude thinks more on higher effort, less on lower — but it’s prompt-dependent, not effort-dictated. “What’s 2+2?” gets minimal thinking even on max effort; a sophisticated research task scales differently.
- Adaptive thinking is NOT a router or auto-toggle. It does not classify the query and pick a thinking-vs-non-thinking model variant. It’s the difference between “you must spend at least one thinking token at the start” and “you can spend thinking tokens whenever and however needed.” Claude chooses inline.
- Performance result. Adaptive thinking is Anthropic’s intelligence-maximizing setting — performance “priority or better with interleaved thinking while delivering a better user experience” since Opus 4.6.
- Effort vs model selection. Bleifer’s framing for the model-vs-effort trade-off:
- Bigger model + lower effort > smaller model + max effort, on intelligence-demanding tasks. Opus 4.7 on low effort spent about the same tokens as Haiku 4.5 on max — and produced a much better result.
- Smaller models for cost on bulk simple tasks. Classification, information extraction, basic summarization — Haiku 4.5 saves significantly when peak intelligence isn’t needed.
- Smaller models for time-to-first-token. Smaller models produce tokens sooner. Use small models for fast time-to-first-token; use bigger models at lower effort for fast time-to-last-token.
- Wherever possible: build eval curves across models AND effort levels, look at trade-offs for your use case.
- The three actionable closing items.
- Enable thinking whenever possible. Thinking is core to how Claude works — give it the inner monologue space.
- Use evals if you have them. Chart curves across effort levels and model types. Always read the transcripts, don’t just look at scores.
- If you can’t / won’t do evals, default to “extra high” effort for software engineering work. Bleifer’s explicit “go-to” recommendation: extra high gives “great bang for your buck while delivering great intelligence.”
- The northstar. “Claude allocates compute incredibly well when asked for it — you set a quality bar and a budget, Claude figures out the rest and gives you the best performance for your use case.” Adaptive thinking + effort + budgets are “a step in this direction. They’re really just the beginning. There’s a lot more to come.”
Where it fits in the wiki
- Sister deep-dive to The Expanding Toolkit (Lucas) and Memory and Dreaming (Mahes) — three Code with Claude 2026 Anthropic deep-dives stacked. Lucas covers the agentic capabilities (tool use, context, code execution, computer use); Bleifer covers the inference-compute lever underneath them; Mahes covers Memory + Dreaming which sit on top.
- Canonical 2026-state explainer for Extended Thinking / thinking primitives. The existing extended-thinking article documents the API parameter; this talk explains the why, the user-side controls, and the adaptive-thinking model. Should be linked as the conceptual companion.
- Anchors the Opus 4.7 Best Practices “use adaptive thinking” guidance. Bleifer’s traffic-simulation demo concretely justifies why low-effort-on-bigger-model often beats max-effort-on-smaller-model.
- Reframes the Cost and Intelligence Levers connection article. That article frames the model-routing trade-off; this talk’s “use small models for first-token speed; use big models at low effort for last-token speed” is a sharper, more concrete framing.
- Connects to Claude Code Token Optimization techniques — Bleifer’s “evals + effort + budget triad” maps directly to the token-optimization playbook.
- The Karpathy “Vibe Coding to Agentic Engineering” thesis intersection. Karpathy’s “ghost vs animal” framing of agents → Bleifer’s “northstar of Claude allocating compute well when asked for it” — same direction, different vocabulary.
Implementation
- Tool/Service: Anthropic API + Claude Code, Opus 4.7 / Sonnet 4.6 / Haiku 4.5.
- Setup:
- Effort dial — set via API parameter, Claude Code config, or
--effortflag. Levels: low / medium / high / max / “extra high” (Bleifer’s recommended default for SWE). - Task budgets — set via API. Forms: tokens, time, cost. “Stop and check in when you hit X.”
- Adaptive thinking — default since Opus 4.6. No setup required.
- Interleaved thinking / sequential thinking — older modes; only opt into them for compatibility reasons.
- Effort dial — set via API parameter, Claude Code config, or
- Cost:
- Direct: tokens spent on thinking + tool-calling + text, billed at the model’s headline rate.
- Indirect: latency. Higher effort = longer wait.
- Integration notes:
- The traffic-simulation demo’s 10× cost ratio (low → max effort) at 10× quality is illustrative, not a guaranteed shape. Some tasks scale steeper, some flatter.
- Adaptive thinking is not a router. Don’t expect to predict whether Claude will think on a given query just from the query’s surface features.
- Use evals to find the right effort/model setting for your use case; don’t blindly trust general benchmarks.
- Read the transcripts. Bleifer says this twice. Token spend without transcript inspection is debugging blind.
- For software engineering work specifically, extra high effort is the recommended default if you don’t have time to eval.
Open Questions
- Effort-level taxonomy. “Extra high” is mentioned as Bleifer’s recommendation but not always listed alongside low/medium/high/max in API docs. Is it a stable level or a transient setting? What’s the canonical level list?
- Budget enforcement semantics. When a task budget is hit (tokens, time, or cost), does Claude check in mid-task or hard-stop? Graceful interruption with summary, or abrupt halt?
- Token-cost-per-effort scaling curve. Bleifer’s traffic demo gives one data point (low → high = 2×, low → max = 10×). What’s the curve shape across other tasks? Linear, sub-linear, super-linear?
- Adaptive thinking + multi-agent. When Managed Agents runs multiple agents in parallel, does each agent independently decide how much to think, or is there a shared budget? Open.
- Effort + Dreaming. Mahes’ Dreaming process is itself a compute-spend decision. Should Dreaming run on max effort to extract the most pattern detection, or low effort to amortize cost across many memory updates? Likely a knob in the Dreaming API; not addressed here.
- The “intelligent driver model” the simulation generated — Bleifer notes Claude named its own behavior pattern. Is this stable across runs? Reproducible? Or a one-off coincidence in his demo? Worth probing for “how reliably do high-effort outputs include their own internal taxonomy.”
Try It
- Watch the talk (YouTube
OXJO4LldSnc) — the traffic-simulation demo is short, concrete, and stays with you. - Run Bleifer’s demo yourself. Pick a creative-coding prompt with subjective quality bands (“create a realistic X simulation”). Run it on Opus 4.7 at low / high / max. Look at the gap. Read the transcripts.
- Set a task budget on a real Claude Code session. Pick a task you’ve previously had run away cost-wise. Bound it to N tokens or M minutes. Watch the check-in behavior.
- Default to “extra high” effort for SWE work for a week. Compare your outcomes against your prior default. Bleifer’s “if you’re not going to eval, just go extra high” is worth taking literally.
- Build one eval curve for a task you do regularly. Sweep effort levels (low / medium / high / max / extra high) on one model. Then sweep models (Opus 4.7 / Sonnet 4.6 / Haiku 4.5) at one effort level. Pick the trade-off that fits your use case. Bleifer’s “build eval curves” is the actionable version of this advice.
- Use the time-to-first-token framing. If your application is conversational and latency-sensitive, switch to Haiku 4.5 for the first token and use Opus only for tool-execution depth. Bleifer’s framing: “small models for fast time-to-first-token; big models at lower effort for fast time-to-last-token.”
Related
- Code with Claude 2026 — Opening Keynote — umbrella talk
- The Expanding Toolkit (Lucas) — agentic primitives
- Memory and Dreaming (Mahes) — top of the stack
- Asana AI Teammates (Ara) — customer-side companion
- Extended Thinking API reference
- Opus 4.7 Best Practices
- Claude Code Token Optimization
- Ollama + Claude Code Cost Savings
- Cost and Intelligence Levers
- Karpathy — Vibe Coding to Agentic Engineering