Source: Vibe_coding_in_prod_Code_w_Claude.md

Erik Schluntz is an Anthropic researcher on coding agents and co-author with Barry Zhang of Building Effective Agents. In this 16-minute Code w/ Claude talk, he draws a sharp line between panic prompting (back-and-forth bug-fix chat with the AI in a tight loop) and vibe coding done right (“forget the code exists, but not that the product exists”). His headline argument: as model task-length doubles every seven months, engineers who insist on reading every line they ship will become the bottleneck — the discipline that scales is being Claude’s product manager, not its reviewer.

Creator: Erik Schluntz (Anthropic, coding-agent research) URL: https://www.youtube.com/watch?v=fHWFF_pnqDk Duration: 16:19 Platform: YouTube (Code w/ Claude series)

Key Takeaways

  • Vibe coding ≠ heavy AI use. Tight feedback loops with Cursor or Copilot are not vibe coding by Karpathy’s original definition. Vibe coding is “fully give into the vibes, embrace exponentials, and forget that the code even exists” — and “forget the code exists” is the load-bearing phrase.
  • Forget the code, not the product. Schluntz’s safe-deployment thesis: you can stop reading every line, but you can never stop owning the product.
  • The exponential is the reason to care. AI task-length is doubling every seven months — already at ~1 hour. When it’s a full day or a full week of work, lock-step review becomes impossible. Treat “what happens when the model is a million times faster?” as a product roadmap, not science fiction.
  • You are Claude’s PM, not its IDE-mate. “Ask not what Claude can do for you, but what you can do for Claude.” Spend 15-20 minutes assembling context — codebase tour, constraints, requirements, files-to-change, patterns-to-follow — into a single plan before you let Claude execute. Schluntz typically builds this plan in a separate conversation with Claude first.
  • Verifiability is the abstraction. Every other domain where humans manage work they don’t understand (CTOs, PMs, CEOs reviewing accountants) solves this with acceptance tests, product use, and spot-checks. Software needs the same: design systems with human-verifiable inputs/outputs so correctness can be checked without reading the implementation.
  • Tech debt is the one thing you still need to read code for. No verification proxy exists yet. So concentrate vibe coding on leaf nodes — the parts of the codebase nothing else depends on, where tech debt is contained.
  • Schluntz shipped a 22,000-line Claude-written PR into Anthropic’s production RL codebase by combining all of the above: days of human context-prep, leaf-node scope, heavy human review on the extensible parts, and stress tests on stability with human-verifiable I/O.
  • Don’t over-constrain the model. Schluntz: “our models do best when you don’t over constrain them.” Treat the prompt like onboarding a junior engineer, not like writing a PRD template.
  • The marginal-cost shift matters more than the time saved. When something that took two weeks now takes a day, you don’t just save a week — you start tackling features you wouldn’t have attempted at all.
  • Vibe coding in prod is not for non-technical operators. You need enough domain knowledge to ask the right questions and recognize danger areas (security being the most common failure).

Panic Prompting vs Vibe Coding

The framing that opens the talk: most “AI coding” is not vibe coding. If you’re using Cursor or Copilot and reviewing diffs as the model writes them — still in a tight feedback loop with the model — that’s heavy AI-assisted coding, but you haven’t given anything up.

Vibe coding, by Karpathy’s original definition, is when you “fully give into the vibes, embrace exponentials, and forget that the code even exists.” The load-bearing phrase is forget the code even exists.

Schluntz’ safe-deployment refinement: forget the code exists, but not that the product exists. The analogy he reaches for is compilers. Early-day developers used compilers but read the assembly output to sanity-check what the compiler emitted. That doesn’t scale — at a certain point, system size forces you to trust the abstraction. We all still know there’s assembly under the hood; most of us never look at it; we still build good software.

The exponential is what forces the issue. Right now AI can do roughly an hour of work per task — fine, you can keep reviewing. But task length doubles every seven months. A year out it’s a day; two years out it’s a week. “There is no way that we’re going to be able to keep up with that if we still need to move in lock step.”

The 7-Step Loop (Discipline, Not Magic)

The companion X bookmark from @ziwenxu_ distilled Schluntz’s talk into seven steps. Each maps directly to a section of the transcript. Together they form the operating discipline that makes “forget the code” safe.

1. Context before code

Schluntz: “When I’m working on features with Claude, I often spend 15 or 20 minutes collecting guidance into a single prompt and then let Claude cook after that.” And critically, that 15-20 minutes isn’t him writing the prompt by hand — it’s “a separate conversation where I’m talking back and forth with Claude. It’s exploring the codebase. It’s looking for files. We’re building a plan together that captures the essence of what I want, what files are going to need to be changed, what patterns in the codebase should it follow.”

In practice: a Claude session whose only output is a plan artifact. That plan then becomes the input to a different session (or a /compact checkpoint) that actually writes the feature. See Context Management in Claude Code for the pattern of using one session to build context, another to execute on it.

2. Plans before diffs

The natural follow-on to step 1: once you have the artifact — files-to-change, requirements, patterns, examples of similar features — then you let Claude execute. Schluntz reports “a very, very high success rate” once that effort is upfront.

On structure: don’t over-engineer the plan. “I would just think about it as like a junior engineer what you would give them in order to succeed.” Skip the rigorous PRD template. Skip over-constraining the implementation when you don’t care how the model gets there.

3. CLAUDE.md before chaos

Schluntz didn’t say “CLAUDE.md” verbatim, but the discipline is the same one the CLAUDE.md primer codifies: persistent context lives somewhere stable, not in your last 10 chat messages. Schluntz’ equivalent is the front-loaded plan artifact + the “think like a PM” framing. The wiki’s broader pattern — see gstack and GSD — generalizes this into project-level CLAUDE.md hierarchies.

4. Git before experiments

Implicit in Schluntz’ production RL example: the 22,000-line change went through a real PR, not an in-place rewrite. Branches, diffs, revertibility — the same guard rails any senior engineer uses, applied to Claude-generated work. The point is not that the AI shouldn’t experiment; it’s that experiments should be cheap to throw away.

5. Tests before trust

On test-driven development with Claude, Schluntz is specific: Claude will go down rabbit holes writing tests that are too implementation-specific. His prescription: “Just write three end-to-end tests and, you know, do the happy path, an error case, and this other error case. I’m kind of like very prescriptive about that. I want the test to be like general and end to end.”

His verification habit: “A lot of times when I’m vibe coding the only part of the code or at least the first part of the code that I’ll read is the tests to make sure that you know if I agree with the tests and the tests pass then I feel pretty good about the code.” This is the verifiability abstraction in action — review the tests, not the implementation; let “tests pass” be the proxy for correctness.

This is also why he emphasizes designing systems with human-verifiable inputs and outputs. The 22k-line RL PR shipped because he could stress-test stability for long durations on inputs and outputs he understood, without reading the implementation in between.

6. Small scopes before big asks

The leaf-node rule. Schluntz: focus vibe coding on “parts of the code and parts of our system that nothing depends on them. They are kind of the end feature.” Trunks and load-bearing branches — the core architecture other things will be built on — still need deep human understanding because tech debt there compounds.

He flags tech debt as the one verification gap that still requires reading code. “Most other systems in life, you have ways to verify the things you care about without knowing the implementation. Tech [debt] I think is one of those rare things where there really isn’t a good way to validate it other than being an expert in the implementation itself.”

The mitigation isn’t to avoid vibe coding — it’s to scope it. Leaf nodes contain the debt; trunks stay human-reviewed.

7. Taste before autopilot

Schluntz’ explicit caveat: “I don’t think that vibe coding and prod is for everybody. I don’t think that people that are fully non-technical should go and try to build a business fully from scratch… they’re not able to ask the right questions. They’re not able to be an effective product manager for Claude.”

The security questioner pushed on this — the recent reports of vibe-coded apps leaking API keys. Schluntz’ answer was the same: “it all comes down to this first point here of like being Claude’s PM and understanding enough about the context to basically know what is dangerous, know what’s safe, and know where you should be careful.” For his own RL example, the system ran fully offline — no payment surface, no auth surface, no secrets to leak. He chose a domain where the safety questions were tractable.

Taste is what tells you which leaf nodes are safe to vibe-code and which still need eyes on every line.

The 22,000-Line PR — How Anthropic Vibe Coded in Prod

Schluntz showed the actual GitHub diff. Claude wrote most of a 22k-line change to Anthropic’s production RL codebase. The recipe:

  1. Days of human context work. Not one prompt. “There was still days of human work that went into this of coming up with the requirements, guiding Claude and figuring out what the system should be.”
  2. Concentrated in leaf nodes. The team chose parts of the codebase they didn’t expect to need to change in the near future.
  3. Heavy human review on the extensible parts. The trunks-and-branches review wasn’t skipped — it was concentrated on the small fraction that mattered for future extensibility.
  4. Designed for stress testing. “We carefully designed stress tests for stability… we designed the whole system so that it would have very easily human verifiable inputs and outputs.” Long-duration stress tests verified stability; I/O design verified correctness.

Result: a week’s worth of work delivered in a fraction of the time, with equivalent confidence to a hand-written equivalent. The downstream realization: the marginal cost of software dropped, which changed what work was worth doing at all.

On Learning, Compaction, and Workflow

A few Q&A points worth surfacing:

  • Learning without the grind. “I have found that I’m able to learn about things so much more quickly by using these AI tools.” When Claude uses a library Schluntz hasn’t seen, he asks Claude to explain it. The pair-programmer effect cuts learning curves. But “people that are lazy are not going to learn. They’re just going to glide by.”
  • Compaction discipline. Schluntz compacts or starts a new session “kind of whenever I get Claude to a good stopping point where it feels like, okay, as a human programmer, like when would I kind of stop and take a break and maybe like go get lunch.” His specific pattern: have Claude find all the relevant files and make a plan → write the plan into a document → compact, “and that gets rid of 100k tokens that it took to create that plan and find all these files and boils it down to a few thousand tokens.” See Context Management for the deeper write-up.
  • Tool mix. He uses both Claude Code (terminal) and Cursor in VS Code together — Claude Code starts things, Cursor for surgical line-level changes when he knows exactly what needs to change.
  • Onboarding to a new codebase. Before writing the feature, “I use Claude Code to help me explore the codebase. So I might say like tell me where in this codebase auth happens… tell me similar features to this and like have it tell me the file names. Have it tell me the classes that I should look at.” Build the mental model first; vibe code afterward.

Personal Backstory

Schluntz broke his hand biking to work and was in a cast for two months. Claude wrote all his code during that period. The constraint forced him to figure out how to make this work effectively — which then informed how he ships the discipline back into Anthropic’s products and models.

Try It

  • Add the 7-step loop to your CLAUDE.md. Open the project’s CLAUDE.md and add the seven discipline points as an “Engineering Loop” section, prefixed with one-line examples of what each step looks like in your codebase.
  • Use a separate session to build the plan. Next feature: open a Claude Code session whose only job is to explore the codebase, surface the right files, and produce a plan markdown. /compact to drop the 100k tokens used to build it down to a few thousand, then execute the plan in a fresh session.
  • Pick your leaf nodes. Walk your repo and label each module as trunk (other things depend on it, tech debt compounds) or leaf (end feature, nothing builds on it). Vibe code the leaves first; reserve trunks for hand review.
  • Design your next system for human-verifiable I/O. Before writing the implementation, define: what are the inputs a non-implementer can construct, and what are the outputs a non-implementer can verify? If you can’t answer that, the system isn’t yet safe to vibe-code.
  • Replace “make tests” with “make three end-to-end tests.” Be prescriptive: happy path, named error case, other named error case. Read the tests; let the tests gate the implementation review.
  • Run the security check. For any vibe-coded surface, ask: does this touch auth, payments, secrets, or user data? If yes, the leaf-node rule doesn’t apply — that’s a trunk regardless of where it sits in the dependency graph.