Source: The prompting playbook (YouTube G2B0YWuJUgI), Margot Vanlar (Applied AI engineer, Anthropic London), Code with Claude London 2026 (breakout-room session, last of the day).

Margot Vanlar walks through prompt engineering as it actually happens at customer engagements — not a list of dos and don’ts, but a live debugging loop on two scenarios she sees most in her day-to-day: (1) maintaining a production prompt that suddenly regresses after a model migration, and (2) building a new agentic use case from zero. Both scenarios are anchored in concrete worked examples — a Meridian Mobile customer-support bot and a retail-staff scheduling agent — with a vibecoded eval harness running live on stage. The unifying thesis: evals come first, then prompting fixes failure modes one at a time, and “instructions don’t add capability” — when the model can’t do the task, give it a tool, not louder words.

Key Takeaways

  • Prompting is still a critical skill. “Prompting is arguably one of the first skills, if not the first skill that we had to learn as engineers when we first started to work with LLMs. And even now it continues to be one of the most critical skills to building effective AI systems.”
  • Two scenarios cover most production prompt work: (1) maintaining a prompt that regresses on model migration, (2) building a new agentic use case from scratch.
  • Evals are the starting point, not the prompt. Without an eval suite, you can’t tell whether a prompt change correlated to a performance improvement. Two failure modes need testing: capable-but-different-behavior (fixable with prompting) vs. not-capable-enough (no amount of prompting will fix that).
  • Three eval case types are mandatory: a control case (unambiguous, should always pass), edge cases (where the model has failed before, so the patch sticks), and capability boundary cases (where the model should hand off to a human or refuse).
  • General hygiene before targeted fixes. Before debugging specific failure modes, apply Prompting 101: structure with XML tags, separate role / policy / guidelines / tone / data, remove redundant copy-pasted website content (hero images, cookie banners), kill the “you are a human” lies.
  • Rule of thumb for prompt structure: “If you’re reading a prompt and you can’t tell guidelines from policy, from data, most likely the model isn’t able to either.”
  • Output contracts where format matters. Define an output format (e.g., XML tags) AND back it up at the harness layer with a stop sequence or structured outputs — don’t rely on the prompt alone for consistency on nested-JSON-style outputs.
  • Defensive patches from old models become liabilities on new ones. A “never give the customer the wrong plan details — point them to the URL” patch designed to suppress hallucinations on an older model caused Claude to withhold information it had access to on the new model. Newer models are better at instruction-following, so old defensive patches get overfitted to.
  • Version-control your defensive prompts. When you introduce a patch for a specific model’s behavior, track why you introduced it so you can backtrack on model migrations.
  • Instructions don’t add capability. Telling the model “critical: always calculate prorated amounts correctly” doesn’t make it better at mental math. The fix is to give it a calculate_proration tool — schema + implementation — so it offloads execution to deterministic code.
  • State both sides of trade-offs. Telling the model “escalating costs $8” without also stating “getting it wrong costs a refund plus customer trust” causes overfitting to the cost side. Smarter models are better at making trade-offs themselves — but only if you give them both sides.
  • When building from scratch, vary three axes: prompt, model, harness. Don’t just iterate the prompt — also try larger models, adaptive thinking, and agentic loops. Margot’s scheduling example walked through Sonnet 4.6 → Opus 4.7 → Opus 4.7 + adaptive thinking → Sonnet 4.6 + better prompt → 3-prompt agentic loop.
  • The generate-evaluate-repair loop beat single-prompt approaches on the scheduling task — lower token cost, lower latency, all five test cases passing. Bonus benefit: soft constraints (e.g., “Harry doesn’t like working with Sally”) can be added to the evaluation prompt at runtime without changing the Python checker.

The two scenarios

Margot frames the talk around the two production prompt situations she sees most often:

  1. Maintaining and migrating a prompt. “You have an existing prompt in production that you’ve been maintaining for some time and possibly you’re migrating it to a new model or making a change to the architecture and for some reason it’s no longer working as well.”
  2. Building from zero to one. “We’re building an entirely new agentic use case from the ground up and we need to build the prompt from zero to one.”

Most of the talk uses scenario 1 to teach the prompting fundamentals, then scenario 2 demonstrates how the same fundamentals plus model/harness choices compose into a complete agent.

Scenario 1 — Debugging the Meridian Mobile support bot

The worked example is a customer-support bot for a fictional telco called Meridian Mobile. The starting prompt is the kind of artifact she sees at customers regularly: multiple authors over time, no clear owner, policy / tone / process / patches-for-old-models all mashed into one paragraph, with a “you are a human” lie at the top and stray hero-image references and cookie banner text copied verbatim from a webpage.

The five eval test cases

The eval is intentionally miniaturized for the talk (real ones are bigger). It covers the three required case types:

  1. Data limit on the basic plan — control case, unambiguous, should always pass.
  2. Proration calculation — “If I switch my plan halfway through the month, what will my bill look like?” Edge case requiring arithmetic.
  3. Policy / plan details — must accurately answer questions covered by policy.
  4. Billing error — must escalate to a human, not try to diagnose.
  5. Hotspot data on a legacy plan — must not withhold information the model has access to.

After the v0 eval run, the control case passes; the other four fail.

Pass 1 — General hygiene (the cleanup before targeted fixes)

Before debugging specific failure modes, Margot applies general best practices:

  • Remove redundant content. The hero image reference, cookie banner text, and “you are a human” line all get cut.
  • Add structure with XML tags. Separate <role>, <guidelines>, <policy>, <tone> sections. “We’ve added XML tags here to define the role, to separate general guidelines, to separate policy, to separate tone of voice.”
  • Re-run the eval. The proration case improves just from the cleanup. A regression on the hotspot case is noted but parked — natural variance, will revisit.

Lesson: simply structuring a prompt better (clearer role description, separated concerns) lifts performance before any failure-mode-specific fixes. This is the prompt hygiene you can return to at any stage of writing and maintaining a prompt — and it matters more as prompts get longer.

Pass 2 — Output contracts

Add an ## Output format section to the prompt telling the model to wrap responses in XML tags. Then back it at the harness layer with a stop sequence that detects the closing tag. For a conversational support bot this barely moves the eval, but it’s a must-have for nested-JSON outputs where downstream parsing failures cascade. For more complex schemas, structured outputs at the API level enforce the contract programmatically.

After hygiene + output contract, two of five test cases consistently pass. Three failure modes remain: hotspot, proration, billing error.

Pass 3 — Hotspot (the “model withholds information” failure)

The customer is on a legacy “grandfathered” plan. The customer-context object passed to the prompt explicitly contains the customer’s hotspot allowance (5 GB). But the model deflects: it cites the current plan’s 4 GB allowance and tells the customer to go check their account URL themselves.

Reading the prompt: an old patch says “never give the customer the wrong plan details — instead point them to the URL.” That patch was designed to suppress hallucinations on an older, weaker model. On the newer model, which is better at instruction-following, the patch overfits — the model now withholds information it has access to.

Fix: rewrite to give a balanced view — “customers on grandfathered plans have different allowances, but the customer information given is the accurate source of truth.”

Lesson: “We worry a lot about hallucinations or the invention of facts and numbers, but actually the opposite can also happen. The model can withhold information that it actually has access to.” Defensive patches from old models can poison new model behavior. Use version control on defensive prompt additions so you can audit and remove them on migration.

Pass 4 — Proration (the “instructions don’t add capability” failure)

Customer asks: “What will my next bill be if I upgrade to the 30 GB plan?” The model reasons through it, does some mental math, but doesn’t give a concrete answer. Not safe to ship.

Original prompt: “critical: always calculate any prorated amounts correctly. Never give a customer a vague answer.”

Fix: give the model a calculate_proration tool. That requires three things:

  1. Tell the prompt about the tool: “Whenever you’re doing any calculations, please use the calculate_proration tool.”
  2. Define the tool schema in the API call — what it does, when to use it.
  3. Implement the tool — the actual deterministic math behind the calculation.

After this change, all proration test cases pass.

Lesson: “Instructions don’t add capability. Telling the model it’s critical to do a calculation right doesn’t make it better at mental math. The correct approach was to give it a tool. Overall, giving it the ability to reason over harder problems and using tools to actually execute them reliably.”

Pass 5 — Billing error (the “state both sides of the trade-off” failure)

A billing conflict comes in. The eval expects escalation to a human. The model tries to diagnose the problem itself instead.

Original prompt: “Avoid escalating or transferring to a care specialist unless absolutely necessary as it costs approximately $8 and it counts against our team’s fast contract resolution.”

This gives the model only one side: the cost of escalating, not the benefit. So the model overfits to not escalating.

Fix: state both sides — “escalating costs $8, but getting it wrong costs a refund as well as customer trust.”

Lesson: “As models become more intelligent, we need to remember to state both sides of the trade-offs because our models are becoming better themselves at making those trade-offs themselves.” Single-sided cost-or-benefit instructions are a common pattern that doesn’t survive model upgrades. Smarter models can make the right call — but only if they have both sides of the ledger.

After all four targeted fixes, every test case passes.

Scenario 2 — Building a new agent from scratch (retail staff scheduler)

The new-agent example is a week-long retail staff schedule generator. Eight employees, a headcount-by-shift grid, hard constraints (e.g., max hours per employee, required certifications per shift). Because the constraints are hard rules, grading uses a deterministic Python function counting violations per generated schedule — not an LLM judge.

Margot tests four configurations to demonstrate how prompt, model, and harness interact:

ConfigApproachResult
ASonnet 4.6 + simple prompt with hygiene + output formatAll 5 test cases fail. Reasoning attempt is decent but burns tokens and doesn’t check work.
BOpus 4.7 + same promptAll 5 still fail, but total violations drop significantly. Reasoning capability is helping.
COpus 4.7 + adaptive thinkingReliably generates compliant schedules. ~3× tokens, ~3× latency (~100s).
DSonnet 4.6 + better prompt (reasoning instructions + “check your work”)2 of 5 pass. Failures hit the max-tokens output limit. Raising max-tokens fixes it but increases tokens and latency further.
ESonnet 4.6 in a generate-evaluate-repair 3-prompt loopAll 5 pass. Lower tokens and lower latency than D.

The generate-evaluate-repair loop structure:

  1. Generator prompt — produces a first draft of the schedule.
  2. Evaluator prompt (separate LLM call) — checks each rule and reports specific violations with evidence. Not programmatic checking — LLM checking.
  3. Repair prompt — receives the violations and makes targeted fixes.

Three small focused prompts running independently beat one big prompt trying to do everything.

Bonus benefit of the agentic loop: soft constraints can be added to the evaluator prompt at runtime without touching the Python grading function. Examples: “Harry doesn’t like working with Sally — separate them where possible” or “we need a third shift on Wednesday.” This makes case-by-case customization possible without code changes.

Going forward: the two appropriate production paths from this experiment are Opus 4.7 + adaptive thinking, or the generate-evaluate-repair loop. The loop is more efficient and adds runtime flexibility.

Recurring anti-patterns to avoid

Pulled from the worked examples:

  • One giant paragraph mixing role + policy + guidelines + data + tone. Use XML tags to separate concerns.
  • Copy-pasted website content (hero image references, cookie banners) bleeding into the prompt.
  • Lies in the role description (telling the bot it’s a human when it isn’t).
  • Stacked patches from previous models with no provenance, no comments, no version control.
  • One-sided cost/benefit instructions that overfit on smarter models.
  • Mental math instructions instead of tools for deterministic calculations.
  • Long band-list instructions (“never do X, never do Y, never do Z”) instead of targeted fixes for specific failure modes.
  • Burning tokens on one big prompt when three small prompts in a loop would be cheaper and more reliable.

Try It

  1. Build the eval before the prompt. Even a five-case eval covering control + edge cases + capability-boundary cases will surface failure modes that gut feel misses. Use it as the regression test on every prompt edit and every model migration.
  2. Run the hygiene pass first. Before debugging a specific failure mode, structure the prompt with XML tags, separate role / guidelines / policy / tone / data, strip copy-pasted web junk, kill any “you are a human” lies. Re-run the eval to measure the lift from hygiene alone.
  3. Audit your old defensive patches on every model migration. Search the prompt for “never,” “always,” “critical” — those are usually patches for an old model’s behavior. Newer models may overfit to them. Version-control these patches and the reason for each one so you can backtrack on migration.
  4. When the model can’t do the task, give it a tool — don’t shout louder. For any deterministic calculation, structured lookup, or precise data manipulation, define a tool schema and implement it. “Critical: always calculate X correctly” is a code smell that means you need a tool, not a prompt edit.
  5. Vary all three axes when building a new agent: prompt, model, harness. Don’t grind on the prompt alone. Try a larger model. Try adaptive thinking. Try splitting one big prompt into a generate-evaluate-repair loop. Measure each variant against the eval on accuracy, tokens, and latency before picking the production path.

Open Questions

  • Tool calling cost vs. inline reasoning cost. Margot’s proration fix used a tool for deterministic math. At what task complexity does the tool-call overhead outweigh the reliability gain? Not addressed in the talk.
  • Eval suite size in real customer engagements. The talk used a five-case eval for demo purposes — Margot says real ones are bigger but doesn’t quote a typical size. What’s the minimum eval coverage that catches a regression reliably in production?
  • When does the generate-evaluate-repair loop break down? It won the scheduling example. At what problem complexity does the three-prompt loop stop being efficient compared to a stronger single-model pass?