Source: Sam Witteveen — How Claude’s Design Agents Work · YouTube V-djAkt0t-M · uploaded 2026-05-01 · 14:45
Speaker: Sam Witteveen Subject: Anthropic’s Claude Design product — not how to use it, but the agentic architecture behind it Audience framing: “Six patterns you can use in your own vertical agent apps — legal agent, sales agent, medical agent, self-education agent, etc.”
A reverse-engineering of Claude Design as a reference architecture for vertical-agent builders. Sam’s thesis: Claude Design is “a really well-done agentic architecture” whose six patterns generalize directly to any vertical agent (legal, sales, medical, education). The qualitative difference isn’t the model — it’s the stack of agentic patterns wrapped around it. He notes Claude Design is built on Opus 4.7 specifically, and his guess is the app was “finished for a while, just literally waiting on this particular version of Claude” — primarily because patterns 4 (self-QA via vision) and 3 (multimodal input) need the upgraded vision model.
The Six Patterns
1. Agentic Context Grounding
Before the agent generates anything, it reads a source of truth. In Claude Design’s case, that’s a detailed “design system” the user builds first — generalized brand context plus specific colors, fonts, button HTML, card components, etc. The agent then decides what to read and what to bring into its context window per task — agentic RAG with progressive disclosure, not a system-prompt dump.
Generalization: legal agent → templated contracts before drafting; sales agent → progressive-disclosure RAG across CRM before outreach.
Sam’s rule: “Never start generating stuff blindly until your agent has grounded itself on the user’s actual data.”
2. Structured Memory
The grounding pass produces a structured memory artifact — reused multiple times in the current project and persisted for future projects. Critical observation: the format is plain markdown / HTML / CSS (or JSON for some agents), not a proprietary schema. Modern models are good enough at these formats that “this memory is portable to any downstream agent.”
Implication for builders: “Your vertical agent’s first output shouldn’t be a user-facing deliverable. It should be restructuring the user’s context grounding into a memory artifact — style guide, sales qualification script, etc. — and then every subsequent generation gets faster and better.”
This is precisely the DESIGN.md / Karpathy LLM-wiki / Refero Styles thesis at the platform level.
3. Iterative Refinement Loop (Multimodal)
Most agent UX falls apart at the chatbot. Claude Design has at least five simultaneous input modes:
- Chat (text)
- Voice
- Hover-on-DOM-element — pointing at a specific UI component while describing it
- Draw-on-screen — scribbling instructions / edits directly on the rendered output
- Screenshot-of-own-output — agent captures its own rendered UI as a vision input
Sam highlights a particularly interesting subpoint: “the model is generating [its own follow-up controls] as tokens” — the sliders, buttons, and questions that pop up aren’t pre-built React components. The model emits them, and the agent’s UI wrapper renders them.
Generalization: “For your own vertical agent, don’t force everything through a chatbot. Let the model generate its own follow-up controls based on what it just produced. A sales agent could generate an aggressiveness slider after drafting an email — much more natural UX than forcing the user to type changes.”
4. Self-QA / Reflection Loop
After generation, the agent renders the output, screenshots its own render, and feeds the screenshot back to the vision model to critique itself — long before the human sees it. It then iterates until the screenshot matches the intent.
This is the pattern most likely waiting for Opus 4.7 — Anthropic’s vision capability had to be strong enough to grade its own work.
Generalization: “If your vertical agent generates anything renderable — contract, email, UI, report, PDF, website — get your agent to render it and critique it before a human even sees it. You’re going to burn more tokens, but the quality is probably worth the tradeoff.”
5. Multi-Variation Generation
Instead of one answer with the user forced to ask for alternatives, Claude Design proactively generates multiple versions — different layouts, structures, color palettes — and the agent has learned the hierarchy of design decisions (layout matters more than typography, which matters more than accent color). It surfaces the big decisions first, gets buy-in, then fleshes out details.
Generalization: “Figure out the axis of variation for your domain. For a sales agent, it might be tone — warm vs direct. Generate options up front, let the user remove uncertainty by picking. This usually beats clarifying questions.”
Sam notes the bottleneck for Claude Code doing this is just throughput — at 50K tokens/sec you could code multiple full apps and let the user pick; we’re not there economically yet.
6. Handoff Pattern
The output is designed to be handed off — to other agents (Claude Design → Claude Code) or to other tools the user already lives in (PowerPoint, Figma, Canva, PDF). Internal storage is mostly HTML/CSS, but exports to all the standard formats.
Generalization: “Don’t trap your output in a proprietary format. Store in markdown / JSON / HTML — every model can read those. Make sure your agent can pass things to the actual tools users currently use.”
Sam adds a caveat on agent-to-agent: A2A was proposed a year ago and “hasn’t really panned out like people thought it would” — but the handoff-to-tools half of the pattern is shipping today.
The Closing Takeaway
“The reason Claude Design feels so qualitatively different — it isn’t any one single pattern alone. It’s the combination of these, and especially the combination of the first two — the memory system with the grounding system, being able to then inform all of the other patterns.”
Sam’s diagnosis of why most enterprise AI agent deployments don’t feel like Claude Design: “Most teams are still writing these massive system prompts trying to describe the context, rather than making the context extremely dynamic via the harness, via the memory system, via the grounding. The models are probably good enough now that you just don’t need to do that anymore. Make the agent build its own memory first and then generate from that.”
Why This Matters for the Wiki
This teardown gives a name and a stack-rank to the patterns that several other wiki articles describe in pieces:
| Sam’s pattern | Wiki coverage |
|---|---|
| Agentic context grounding | DESIGN.md format, Refero Styles (designed-token DESIGN.md payload), Karpathy’s “what is the thing I should copy paste to my agent?” pattern |
| Structured memory | Simon Scrapes’ connected-skills memory model (soul.md/user.md/memory.md), Nate Herk’s AIS-OS context layer, the Karpathy LLM-wiki pattern this vault implements |
| Iterative refinement (multimodal) | Claude Design entry, Computer Use for vision-driven UI |
| Self-QA loop | 2026 Claude Code AIOS Pattern convergent practitioner self-improvement loop |
| Multi-variation | Agent Teams parallel-evaluator workflow pattern |
| Handoff | Cowork Dispatch, HeyGen Studio Automation, Higgsfield MCP |
Combined with Karpathy’s AutoResearch loop, this gives wiki readers a complete reference architecture for vertical agents in 2026 — patterns named, components mapped, concrete examples per pattern.
Try It
Sam’s six-pattern checklist applied to a vertical agent you’re building:
- Grounding: What’s your domain’s “design system” equivalent? Build it first as the user’s first interaction. Then have the agent decide what to load per task — don’t dump.
- Memory: What’s the structured artifact that survives between sessions? Markdown / HTML / CSS / JSON — never proprietary. The agent’s first output is this artifact, then the user-facing thing.
- Multimodal input: Beyond chat — what other input modes fit your domain? Voice + hover + draw work for visual outputs. For sales: voice + select-on-CRM-record + slider-controls-the-model-generates. Don’t trap users in a chatbox.
- Self-QA: What does your output render to? Render it, screenshot it, critique it via the vision model, iterate, then show the user. Burn the tokens.
- Multi-variation: What’s the hierarchy of decisions in your domain? Surface the top-1 or top-2 axes proactively as multiple variants. Layout-vs-typography-vs-color is the design analog; for sales it’s tone, for legal it’s risk posture, for medical it’s evidence threshold.
- Handoff: What tool does the user actually finish their work in? Export to that — PowerPoint, Figma, Canva, PDF, Slack message, Salesforce field. Storage stays in portable formats.
Test the bundle: build a vertical agent with all six, then compare to one with only patterns 3+5 (chat + variations). Sam’s claim is the gap will be qualitative, not incremental.
Related
- Claude Design — the product Sam is teardown-ing
- Claude Design Walkthrough (Paul Couvert) — usage-side perspective; pairs with this architecture-side teardown
- Claude Design Prototypes & UX — Anthropic’s own framing of the product
- Claude Design Presentations — handoff-pattern example (PowerPoint export)
- Claude Design Use Cases — leopardracer 10-use-cases playbook
- DESIGN.md format — the grounding-pattern format applied to design
- Refero Styles — DESIGN.md payload built for grounding-pattern consumption
- Karpathy: From Vibe Coding to Agentic Engineering — the Software 3.0 frame Sam’s patterns operate within
- Karpathy’s AutoResearch loop — sibling self-improving-loop pattern
- 2026 Claude Code AIOS Pattern — convergent practitioner evidence on patterns 1+2 specifically
- Agent Skills overview — skill format as another expression of the grounding+memory pattern
- Opus 4.7 best practices — vision-model upgrade enables pattern 4
Open Questions
- Confirmation that Claude Design uses these exact patterns. Sam’s teardown is observational — based on watching the product behave. Anthropic has not (as of fetch date) published the architecture publicly. Worth tracking whether they confirm or correct the model.
- Multi-variation hierarchy formalization. Sam claims the agent has “learned the hierarchy of decisions” — is this prompt-engineered, RL’d, or emergent? Open.
- Token-economics of the self-QA loop. Pattern 4 is “burn more tokens for quality.” Whether that’s a constant 2-3× multiplier or something steeper depends on how many critique-and-iterate cycles run per generation.
- A2A status. Sam mentions A2A “hasn’t panned out like people thought” — worth checking against the Agent Teams reality and the broader MCP-vs-A2A landscape.
- Vertical agent reference implementations. Strong candidate for a follow-on connection-style article: pick a non-design vertical (legal, medical, or sales) and walk through all six patterns concretely.