The Capability Curve — Jeremy (Anthropic Research PM)

Source: The capability curve (YouTube DNRddIEoH3c), Jeremy (Product Manager, Anthropic Research — coding behaviors and capabilities), Code with Claude London 2026 (uploaded 2026-05-22). Transcript via local Whisper fallback (no YouTube captions).

Jeremy’s “capability curve” is Anthropic’s internal name for the year-over-year coding-capability trajectory of the Claude model family — not any single model release. His thesis: the foundation is shifting beneath developers fast enough that the question is no longer “which model is best today” but “how do you architect an application that absorbs the next model’s gains automatically?” The talk uses 12-month deltas (Sonnet 3.7 → Opus 4.7 → Mythos preview) to make three concrete capability claims, then lays out four adoption patterns that let applications ride the curve instead of getting flattened by it.

Key Takeaways

SWE-bench Verified, 12-month delta: Sonnet 3.7 ~60% → Opus 4.7 87%. “Three times more issues solved” in a year. Mythos preview has saturated the benchmark — Anthropic no longer uses SWE-bench Verified internally because frontier models leave no room to improve. Benchmarks are now releasing slower than the models pass them.
Three capability axes are driving the gains: (1) planning and reasoning before acting, (2) error recovery and adapting to failure, (3) sustained attention over long agentic runs.
Doom looping is solved. The 12-month-ago failure mode of “Claude claims to have fixed it but actually re-applies the same broken solution” essentially no longer happens — models receive tool results, spend thinking tokens reasoning about the failure, and change approach.
Coherence holds to 1M+ tokens. A year ago, complex specs got forgotten partway through hundreds-of-thousands-of-tokens runs. Now Claude holds the spec across millions of tokens. “Be more ambitious — hand it the whole code base.”
Long-horizon agents now run for hours, not minutes. The capability axes stack into autonomy: plan → execute → verify against environment → iterate → re-validate against goal every few checkpoints.
Anchor example — Bun rewritten in Rust in one week. Jarred Sumner (Bun founder) had Claude rewrite the entire JavaScript engine from C++ to Rust in a single week, hitting ~100% pass rate on Bun’s near-100%-coverage test suite. Jarred does not know Rust. The PR merged; Bun is now Rust. Would have taken months for an individual. (Cross-references the Boris Cherny + Jarred Sumner talk on Robun.)
“Most software at Anthropic is now written by Claude” — CEO Dario, quoted by Jeremy. “Claude has written most of the code in Claude Code.” Roughly half the London 2026 conference room reported shipping a PR in the last week that was completely written by Claude.
Customer signals corroborate the three axes: Vercel — models do system-code proofs during planning. Wenzer — market-leading coherence over multi-hour runs. Shopify — Opus 4.7 was a step up in code quality and self-verification.
Four adoption patterns to ride the curve: (1) evals, (2) shrink scaffolding, (3) give the model room to work, (4) close the agent loop.
Saturation is the enemy. Customers running pre-existing evals against a new model often see only ~1% improvement and conclude the model isn’t better — when the actual problem is their eval saturated months ago and no longer measures progress.
Shrink your scaffolding over time. Frankenstein prompts (3,000-line system messages built up to patch old failures) actively hurt new models. As models follow instructions more precisely, stale instructions create new bugs. Audit and cut prompts on every model upgrade.
Auto mode is now near-universal at Anthropic. “Almost every software engineer at Anthropic at this point is using auto mode” — a prompted classifier checks each tool call for safety and only loops humans in for critical/dangerous actions.
Close the agent loop. Plug Claude Code into your own agent system + evals, then ask it to improve its own prompts and tools against the eval score. Self-improvement without the developer hand-iterating every piece.

The capability curve framing

The curve is not about any individual model release — it’s the trajectory itself. Jeremy frames the developer’s job as architectural, not benchmark-chasing:

“Every couple months, the models are becoming significantly more intelligent and that really should change how we think about building applications.”

The foundation is shifting beneath your feet. Applications built to absorb model improvements pull ahead; applications that hard-code workarounds for last quarter’s failure modes get left behind. The talk’s structural argument: the three capability gains (planning, error recovery, sustained attention) stack into autonomy, and autonomy unlocks long-horizon agents that can run for many hours — but only if your scaffolding, evals, and harness are positioned to receive those gains.

SWE-bench saturation and the demo

Two empirical anchors carry the curve framing:

SWE-bench Verified, 12-month delta. Sonnet 3.7 sat at ~60% a year ago. Opus 4.7 is at 87%. That’s not a percentage improvement — it’s “solving three times more issues.” Mythos preview (Anthropic’s most frontier model) has saturated the benchmark entirely. Internally, Anthropic no longer uses SWE-bench Verified as a progress signal because the headroom is gone. Models are now releasing faster than benchmarks can come out.

The Claude.ai rebuild demo. Same prompt, same task, two models 12 months apart: “Rebuild the entirety of the Claude.ai website from scratch in one shot.”

Sonnet 4 (12 months ago): jumps in, doesn’t plan, doesn’t self-correct, writes ~2,000 lines, produces a basic UI where chat doesn’t actually work.
Opus 4.7 (today): tool-uses, writes ~1,700 lines (fewer lines for better output), produces a working application with chat input, completions, chat history, sidebar, formatted outputs, mermaid diagrams in chat — and added dark mode like a true developer.

The demo is Jeremy’s anti-benchmark argument: if you tried any task 12 months ago and it failed, try it again — the delta on real-world coding tasks is dramatic, often more dramatic than the benchmark numbers suggest.

The three capability axes

1. Planning and reasoning before acting

Old failure mode (Sonnet 3.7): jump in, start building, look at the plan after failing. “Like me assembling IKEA furniture.”

Now: models read before acting, compose plans with high likelihood of success, investigate before executing. They catch their own mistakes while writing the plan — “actually,” “never mind” mid-reasoning, then revised approach. Iteration cost drops because the spec is already correct before execution starts.

Builder implication: give Claude time to think. Don’t manually scaffold “you must plan first” — select high reasoning effort and let Claude develop the plan on its own. Adaptive thinking is the default lever.

2. Error recovery and adapting to failure

Old failure mode (12 months ago): doom looping. Claude claims “aha, I’ve got it, fixed it” — and the diff is the same broken solution. Repeats with minor variations indefinitely.

Now: doom looping essentially doesn’t happen. Models receive tool results, spend test-time-compute thinking tokens reasoning about the failure, and change approach. The loop becomes try → fail → reason about why → try differently.

Builder implication: better task performance with fewer wasted tokens. Give the model (a) the ability to execute and observe failures, (b) feedback from the environment via tool results, (c) room to reason about that feedback. Those three together drop the iteration count from “many retries with the same wrong answer” to “a couple of focused retries that converge.”

3. Sustained attention over long agentic runs

Old failure mode (12 months ago): hundreds of thousands of tokens into a refactor, the model “loses the plot.” Forgets the spec, misses instructions from the opening prompt, drops fine points partway through.

Now: coherence holds to 1M tokens and beyond. Specs from the opening prompt survive through millions of tokens of execution. Not perfect coherence yet, but qualitatively closer.

Builder implication: be more ambitious. Stop pre-emptively chopping tasks into 200K-token bite-sized pieces. Hand Claude the whole code base. The harness and model together can run for millions of tokens — the assumption that “this is too long” is now usually wrong.

How these stack into long-horizon agents

The three axes compose:

plan → execute → verify against environment (run tests) →
iterate on failures → every few checkpoints, validate against goal

Agents now run for hours, not minutes. The Bun-in-Rust example is the anchor: Jarred Sumner (Bun founder, doesn’t know Rust) had Claude rewrite Bun’s entire JavaScript engine from C++ to Rust in a single week against Bun’s near-100%-coverage test suite. Hit ~100% pass rate. PR merged. Bun is now written in Rust. A solo engineer with Claude completed in a week what would have taken months in the old paradigm — and only because (a) the test suite was already comprehensive and (b) Jarred had the ambition to ask whether Claude could do it.

The structural lesson: long-horizon autonomy is unlocked by verification surfaces (tests, evals, type-checks, integration checks) that let agents close their own loops. No verification → no autonomy.

Customer signals corroborating:

Vercel — models doing systems-code proofs during planning before they touch the implementation.
Wenzer — market-leading coherence over multi-hour runs.
Shopify — Opus 4.7 step-up in code quality and self-verification mid-task.

Riding the curve — four adoption patterns

Jeremy’s argument: the highest-leverage thing builders can do is architect applications that absorb each next model’s gains without rewrites. Four patterns:

1. Evals are critical

“Evaluations are just the unit tests and the regression tests of the AI era. Every software application that uses AI should have evaluations. If you don’t, it’s similar to not having unit tests for your traditional application.”

Concrete rules:

Just start. Anthropic has an engineering-blog guide; the first step is building any eval at all. Teams are afraid evals are an academic exercise requiring researchers — they’re not.
Measure what you actually care about. Don’t use SWE-bench Verified or BrowseComp or terminal-bench as your eval — those are academic benchmarks, not your application. Collect failure modes from your actual customers, build evals from real traffic.
Know when your eval is saturated. If Opus 4.7 already gets 90% on your eval and the remaining 10% is impossible or unfair, the eval is dead — it cannot measure model progress anymore. Customers running saturated evals against new models see ~1% improvement and conclude the model isn’t better; the actual problem is the eval is in the past.
Keep raising the bar. Refresh evals as models improve. Distinguish regression evals (where 100% pass is expected) from progress evals (which should stay unsaturated).
Benchmark every new model on your evals. Companies with evals adapt to new models fastest; companies without them read Twitter for vibes and lose weeks. Eval-equipped teams have a competitive advantage because the biggest application improvement is usually swapping in the most-frontier model.

(See Lucas’s London 2026 talk on private evals for the deeper how-to.)

2. Shrink your scaffolding over time

“Scaffolding” = everything around the LLM: prompts, tools, execution environment, skills, harness. The Frankenstein prompt failure mode:

“Eventually you have 3,000 lines of mostly prompt instructions that were designed for previous models and for failures that might not even happen anymore.”

As models follow instructions more precisely, stale instructions become active bugs. Concrete Anthropic example from the Claude 4 launch: the claude.ai system prompt had a citation-format example referencing a format Anthropic no longer used. Older models had ignored it; Claude 4 followed it. Tweaking a few characters in that one example fixed an entire class of citation errors.

Builder implication: on every model upgrade, audit your system prompt. Cut what’s not relevant. Describe what you intend — not how to work around the quirks of the previous model. Use your evals to test whether cutting the prompt to bare minimum still passes.

3. Give the model room to work

Three sub-patterns:

Allow thinking when appropriate. All frontier models are reasoning models — they benefit from test-time compute. Allow adaptive thinking so the model can choose to think when appropriate. For sensitive use cases (software engineering, enterprise agents), dial the effort parameter close to maximum.
Allow autonomous operation — safely. Auto mode in Claude Code uses a prompted classifier to check each tool call: is this safe to approve automatically, or does it need human approval? Anthropic’s published a blog post on this. “Almost every software engineer at Anthropic at this point is using auto mode” — humans get looped in only for critical or dangerous actions. The pattern is portable to any application.
Don’t be scared of “model deletes my cluster” — solve it architecturally. The fix isn’t human-in-the-loop on every action; it’s a classifier that knows which actions need humans.

4. Close the agent loop

The most powerful pattern. Architect your system so Claude Code can inspect its own outputs and iterate on them to improve itself.

Jeremy’s concrete example: he plugs Claude Code into the agent he’s working on, points it at the evals, and asks “how can I improve the prompt? How can I improve the tools to get a higher score on this application?” Because Claude Code has access to the full loop — agent, environment, evaluation — it can run the agent itself, run the eval itself, and autonomously improve the system without the developer hand-iterating every change.

“If you can give Claude the ability to iterate on your own system, then you can sort of get to the point where you’re almost self-improving.”

How Anthropic thinks about evaluation

Two structural shifts in how Anthropic itself measures progress:

Benchmarks are saturating faster than they ship. SWE-bench Verified was the standard; Mythos preview saturated it. The pipeline of new benchmarks is now slower than the pipeline of new models that beat them.
Demos > benchmarks for capability communication. The Claude.ai rebuild demo is the kind of artifact Jeremy reaches for, because it shows the qualitative delta (1,700 lines for a working app vs 2,000 lines for a broken one) that a saturated benchmark can’t.

The implication for builders: don’t trust public benchmark numbers as your North Star. Build your eval against your traffic, and trust those numbers more than any leaderboard.

Try It

Build one eval for your highest-stakes AI feature this week. Start with five real customer failure traces. Don’t worry about coverage; worry about whether the eval moves when models change. (Cross-reference picking-the-right-model-evals for the methodology.)
Audit your system prompt against Opus 4.7. Diff what’s actually needed against what’s there. If your prompt has lines added to patch a Sonnet 3.7 failure mode that no longer exists, cut them. Use your eval to confirm performance held.
Re-run a long-running task you gave up on 12 months ago. Hand Claude the whole repo and an Opus-4.7-class model. The capability curve says it can probably do it now.
Turn on auto mode and run a multi-hour task unattended. Pick a refactor or a “rewrite this in language X” task. Let the test suite be the verifier. (Cross-reference bun-robun-claude-code-issue-automation for the Bun template.)
Close the agent loop on one workflow. Plug Claude Code into an agent you’ve already built + its eval, then ask Claude Code to improve the prompt/tools to raise the eval score. Set a budget cap and observe.

code-with-claude-london-2026-keynote — the same conference’s opening keynote, with the Mythos preview / Managed Agents / MercadoLibre framing the capability curve sits inside.
picking-the-right-model-evals — Lucas’s London 2026 talk on building private evals, the direct how-to for adoption pattern #1.
anthropic-vibe-coding-in-prod-erik-schluntz — Erik Schluntz’s 7-step loop is the practitioner counterpart to Jeremy’s four-pattern adoption framework.
spotify-coding-no-longer-constraint-honk-fleet-shift — Niklas Gustafsson’s “coding is no longer the constraint” thesis is the org-scale version of Jeremy’s “shift your application architecture” thesis.
managed-agents-self-hosted-sandboxes-mcp-tunnels — same-week capability launch; the long-horizon-agents thesis depends on the infrastructure primitives this talk announces.
bun-robun-claude-code-issue-automation — the Boris Cherny + Jarred Sumner London talk; Jarred’s Bun-in-Rust rewrite is Jeremy’s anchor example.
fiona-fung-ai-native-engineering-org — Fiona’s “pick your noisiest workflow” closing prompt operationalizes Jeremy’s “close the agent loop” pattern at org scale.
context-management-claude-code — the practical context-management discipline that makes Jeremy’s “be more ambitious, hand it the whole code base” advice tractable.
plan-mode — the Claude Code feature that operationalizes capability axis #1 (planning before acting).
The Verification Frontier — the synthesis built on this talk’s “no verification → no autonomy” claim, joined with the recursive-self-improvement and DeepMind-AI-for-science framings.

Jonathon's AI Wiki

Explorer

The Capability Curve — Jeremy (Anthropic Research PM)

Key Takeaways

The capability curve framing

SWE-bench saturation and the demo

The three capability axes

1. Planning and reasoning before acting

2. Error recovery and adapting to failure

3. Sustained attention over long agentic runs

How these stack into long-horizon agents

Riding the curve — four adoption patterns

1. Evals are critical

2. Shrink your scaffolding over time

3. Give the model room to work

4. Close the agent loop

How Anthropic thinks about evaluation

Try It

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

The Capability Curve — Jeremy (Anthropic Research PM)

Key Takeaways

The capability curve framing

SWE-bench saturation and the demo

The three capability axes

1. Planning and reasoning before acting

2. Error recovery and adapting to failure

3. Sustained attention over long agentic runs

How these stack into long-horizon agents

Riding the curve — four adoption patterns

1. Evals are critical

2. Shrink your scaffolding over time

3. Give the model room to work

4. Close the agent loop

How Anthropic thinks about evaluation

Try It

Related

Graph View

Table of Contents

Backlinks