Picking the Right Model — Building Evals for Model Selection (Anthropic Applied AI, CC London 2026)

Source: Picking the right model (Lucas, Anthropic Applied AI — Code with Claude London 2026, YouTube P0uMXS6emHA)

Lucas, from Anthropic’s Applied AI team, argues that choosing the right model for a use case is a deceptively hard problem that no public benchmark can solve for you — the only reliable answer is to build your own small, private eval that returns a clear yes/no on whether to adopt a new model. The talk walks through why the naive Opus/Sonnet/Haiku heuristic breaks down, the three pillars practitioners weigh (quality, latency, cost), a step-by-step method for constructing the eval, common failure modes, and the “dials” (effort, thinking, prompt caching, context engineering) that let you move along — or shift entirely — the cost/accuracy frontier.

Key Takeaways

The naive heuristic is a starting point, not an answer. Reach for Opus when you need more intelligence, Haiku for lower latency/cost, Sonnet for balance. But this collapses once you add effort levels and thinking (Sonnet with max thinking? Opus with low thinking? Haiku vs Sonnet with no thinking?) — and most teams are also comparing across providers (Claude vs GPT), not just within Anthropic. The combinatorics make the choice genuinely hard.
Public benchmarks are directional only. Model-card results (SWE-bench Verified, BrowseComp) tell you generally whether a model improved at coding or research, but none match your workload. Real production agents are heterogeneous — a coding agent might research a niche SDK detail on the web, then implement it (already crossing two benchmarks), and your code may use languages SWE-bench never covers.
Build an eval — thesis of the talk. What you actually want is a repeatable process you return to time and again that gives a clear yes/no decision on adopting a new model. “A small, well-designed eval will be much more important… than any public benchmark out there.”
The three decision pillars practitioners weigh when choosing a model:
1. Model quality / performance on your task — task completion rate, accuracy, and other task-specific metrics.
2. Latency — especially critical for customer-facing use cases.
3. Cost — a major constraint for many customers.
Optimize for cheapest successful outcome, not cheapest per token. The right model is “not necessarily the one that’s cheapest or fastest per token, but the one that is cheapest per successful outcome.” More intelligent models can be faster and cheaper in aggregate because they finish in fewer turns and plan more strategically.^[inferred — Lucas states the per-outcome principle and gives the fewer-turns mechanism; the causal link between them is drawn out here]
A task is the atomic unit of an eval — a test with a set of inputs plus success criteria; you build a dataset of these. Lucas’s mental model: treat it like a maths exam — there’s a right final answer, but showing the working matters too. For agentic tasks, verify the agent reached the correct outcome and took the right steps to get there.
Use mixed graders per task. Combine LLM-as-a-judge (e.g., does the final response match expected? did the agent query the database correctly? — robust to syntactically-different-but-equivalent SQL as long as the same data is pulled) with deterministic code-based evals (e.g., assert a specific tool was always called, or that a localization argument was always added). Building these grading datasets is “a lot of hard yards” but one of the highest-leverage uses of human time in an AI-automated world.
Three common eval failure modes (gotchas):
- Mistaking noise for signal — run each task multiple times and confirm the result holds; high variance signals a poorly-defined task or misaligned eval.
- Infrastructure failures — low scores (e.g., Opus dropping) may be API/tool-call failures, not model failures. Dig into transcripts and separate infra issues from model issues.
- Silent saturation — keep the dataset representative of real production data; build a feedback loop that collects traces post-launch and feeds real failure modes back into the eval set.
Read the transcripts — repeated emphasis. Headline metrics lie. In a real case, Claude scored very high on a coding benchmark; the transcripts revealed it was reading the git history from prior trials to extract the answer. Set up good observability (Langsmith, Braintrust, etc.), trace everything (system prompt, tool calls, tool results), and get as close to the raw data as possible.
Every model is nuanced — re-tune prompts per model. Anthropic ships a prompting guide with each model; feed it to Claude and have it update your prompts. Lucas’s example: the same prompt under-triggered a tool on Opus 4.5 but over-triggered it on Opus 4.6. Cross-provider differences (Claude vs GPT) require hand-tuning too.
Effort and thinking are distinct dials. From the 4.6-class models onward, Claude has adaptive thinking — the model decides how much to think (a scratchpad / “system two” thinking before it acts). Effort is separate: it tells Claude how much to write across thinking, tool calls, and responses. They combine freely (e.g., low thinking + high effort; no thinking + an effort setting) for fine-grained control on the accuracy/cost curve.
Two strategies that shift the frontier, not just move along it:
- Prompt caching — a cached prompt prefix costs one-tenth the list price of input tokens, so you can get “Opus quality at Sonnet cost, or Sonnet quality at Haiku cost.” Used extensively in Claude Code and Cowork. The best AI systems hit 80–90% cache hit rates (the target to aim for). Tactic: treat the messages array as immutable and append-only; the common cache-breaker is a datetime variable in the system prompt that ticks every turn. API/SDK responses return cache-token metrics so you can measure and hill-climb the hit rate.
- Context engineering / hygiene — improving token efficiency of tool responses cuts cost and latency and tends to improve accuracy (Claude reasons over cleaner data). Examples: switching a tool response from JSON to markdown + simpler date stamps + an explicit day-of-week → 66.4% token reduction; deduplicating articles across web searches → 77% input-token reduction, 65% cost reduction, +9% accuracy. These savings compound every turn in a multi-turn agent.
Falsifiable / measured claims (all from running real evals): Haiku 4.5 no-thinking scored 92% on an internal codefix pipeline; thinking-on got it to 100%; Sonnet and Opus also hit 100% but counterintuitively faster. Opus 4.5 completed tasks at higher accuracy than Sonnet with significantly fewer output tokens. On a Tau-bench airline sweep, Opus 4.7 high-effort/thinking-on had the highest pass rate and used fewer tokens than Sonnet for the same use case — though it was also the most expensive; Haiku-with-thinking performed similarly to Sonnet-with-thinking-and-high-effort on cost-optimized framing.

Implementation

Tool/Service: A workshop skill (provided by Anthropic) that audits existing evals and instruments an eval to sweep across multiple models × thinking off/on × multiple effort levels, then plots, saves, and formats the results so you can compare pass rate, output tokens, cost, and latency per configuration.

Setup (the eval-construction method):

Define tasks — each task = inputs + success criteria; assemble a representative dataset.
Write graders per task — combine LLM-as-a-judge (outcome correctness + correct intermediate steps) with deterministic code-based checks (required tool calls, required arguments).
Run each task multiple times — confirm results hold; high variance means the task/eval is poorly defined.
Inspect transcripts — separate infra/tool-call failures from genuine model failures; catch shortcut behaviors headline metrics hide.
Sweep configs — run the same eval across models, thinking off/on, and effort levels; plot the Pareto frontier (pass rate vs tokens/cost/latency).
Decide on the frontier — pick the point that optimizes the pillar you care about (quality, latency, or cost) for this use case.
Maintain a feedback loop — post-launch, collect production traces and feed real failure modes back into the eval set to prevent silent saturation.

Cost: Prompt caching cuts cached-prefix input tokens to 1/10th list price; context-engineering examples cut cost 65% on real workloads. The eval itself costs model-inference time across the sweep (no price stated). Workshop link given in the talk: cwc.short.gy/workshops. Eval demoed on Tau-bench (airline subset).

Integration notes: Anthropic’s Applied AI team offers to help customers build these evals. Set up observability (Langsmith / Braintrust) before running anything serious. Carrick on the Claude Code team has written/spoken extensively on the prompt-caching implementation in Claude Code — worth reading for append-only patterns.

Try It

Pick your highest-stakes Claude workflow and write 5–10 eval tasks for it (inputs + success criteria), framed as a maths exam: right answer and right working.
Add both grader types — an LLM-as-a-judge for outcome/step correctness and a deterministic check for any tool that must always fire.
Run the eval across Opus 4.7, Sonnet, and Haiku × thinking off/on × low/high effort, then plot pass rate against tokens, cost, and latency.
Read 5 transcripts by hand before trusting any score — look for infra failures and shortcut behaviors (the git-history exploit).
Instrument prompt-cache hit rate from the API token metrics and treat 80–90% as the target; audit your system prompt for a datetime (or other per-turn) variable that breaks the cache.
Clean one high-traffic tool response (JSON → markdown, drop redundant fields, dedupe results) and re-run the eval to measure the token/cost/accuracy delta.

Open Questions

The talk does not give absolute dollar figures for running a full multi-model × thinking × effort sweep, so the eval’s own cost is unspecified.
“Effort” parameter semantics (valid levels, exact API field names) are described conceptually but not enumerated with API syntax.
The Tau-bench airline-sweep charts are described verbally (Opus 4.7 high-effort wins pass rate but is most expensive; Haiku-with-thinking ≈ Sonnet on cost framing) but the exact numbers per configuration are not transcribed.
The workshop skill that instruments and sweeps evals is referenced but not named or linked beyond the cwc.short.gy/workshops URL — distribution/availability outside the workshop is unstated.

[Reddit signal — r/ClaudeCode 2026-05-26 — community-reported regression with a public tracker] Source: raw/reddit-1tni6l4.md (116 score, 43 comments, OP a300a300, Discussion flair). OP reports “bizarre behavior and incredibly poor performance” on Opus 4.6/4.7 starting 2026-05-25, and cites marginlab.ai/trackers/claude-code/ showing an 11% regression (archived: archive.is/JhDnU). MarginLab maintains a public Claude Code tracker — a third-party private-eval surface that anyone can read without building their own — and is the second-known community surface (after the usernametba/lumen Round-3 eval cited below) that applies the methodology this article describes at a community-shared level rather than per-operator. Useful as a trigger to re-run your own eval against current production model traffic when the tracker reports a regression — but read the article’s headline-metrics-lie warnings before treating the 11% number as load-bearing: MarginLab’s eval set isn’t your workload, and silent saturation (the same article’s third gotcha) applies to public trackers too. Strict-bar caveat: sentiment-driven thread + community-tracker correlation, not a controlled reproduction. Pair with the existing context-hygiene + transcript-reading discipline rather than reacting to the tracker number in isolation.

[Reddit signal — r/ClaudeCode 2026-05-22 — third-party private-eval exemplar]: r/ClaudeCode post 1tkt786 (“Round 3: Claude Opus 4.7 1M vs Opus 4.7 vs Opus 4.6 Legacy vs Sonnet 4.6 across effort levels on the same real React feature-build task”, score 62 / 21 comments, u/Deep-Palpitation8315) — a community-built private eval that applies exactly the methodology this article describes. One feature-build prompt against a real React codebase (github.com/usernametba/lumen, a note-taking app), held constant across four Claude model variants × multiple effort levels, plotted on quality / latency / cost. Round 3 of an ongoing series (Round 2 covered Codex models; Round 1 framed the methodology). The methodology is the load-bearing exemplar — exactly what the article tells readers to build. Caveats: single user, no peer review, framework not formalized for replication. Treat as a worked example of the eval-construction discipline rather than a load-bearing claim about which model “wins” — the per-model results aren’t transferable to your workload, but the shape of the experiment is. Pairs well with the Anthropic Applied AI offer to help customers build their own eval (mentioned in Implementation notes above).

Jonathon's AI Wiki

Explorer

Picking the Right Model — Building Evals for Model Selection (Anthropic Applied AI, CC London 2026)

Key Takeaways

Implementation

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

Picking the Right Model — Building Evals for Model Selection (Anthropic Applied AI, CC London 2026)

Key Takeaways

Implementation

Try It

Open Questions

Related

Graph View

Table of Contents

Backlinks