Source: wiki synthesis: Claude Agent Hierarchy, Agent Workflow Patterns, Subagents, Agent Teams, Managed Agents, ultrareview, Opus 4.7 Best Practices
Time: Read 15 min | Watch 30 min | Practice 60 min — total ~105 min
Watch First
One reference and one demo before you read the rest of the module.
- Anthropic’s “Common Workflow Patterns for AI Agents” (https://claude.com/blog/common-workflow-patterns-for-ai-agents-and-when-to-use-them) — the official taxonomy of Sequential, Parallel, and Evaluator-Optimizer. ~10 minutes. The decision framework in Section 2 of this module is built directly on this post.
- Anthropic’s “Building Effective AI Agents” white paper (https://resources.anthropic.com/ty-building-effective-ai-agents) — the long-form version. Skim if you want the deeper rationale; skip if you’re tight on time.
If you want to see a multi-agent system actually run, the /ultrareview walkthrough in Section 3 is the live demo. Operator-track folks can watch a Builder run it; Builder-track folks run it themselves.
Why It Matters at WEO
Modules 1, 2, and 3 made you better at single-Claude work — sharper prompts, packaged Skills, the right tool integrations. Most of what you do in a week still belongs in one chat. That’s not changing.
What changes in Module 4 is the ceiling. There are tasks where one Claude — no matter how well prompted — produces uneven output because the same model is being asked to do incompatible jobs at the same time. Research and writing want different reasoning styles. Voice-checking and drafting want different rule sets. Long pipelines drift the further you get from the brief.
Multi-agent patterns fix this by giving each phase its own Claude with its own role, prompt, and model tier. The orchestrator hands work between them. The result: voice consistency holds across long workflows, costs drop because cheaper models do the simpler stages, and the “system of Claudes” becomes an artifact you can save, share, and run on a schedule.
Three things to internalize before you read on:
- Single-Claude is still the default. About 75 percent of the time someone says “I should use multi-agent,” they should actually use a better prompt. This module starts with the decision tree so you don’t over-build.
- Workflow shape is a separate question from agent primitive. Sequential / Parallel / Evaluator-Optimizer is the shape. Subagents / Agent Teams / Managed Agents is the primitive. You pick both, in that order.
- Model tiering inside multi-agent systems is where the cost discipline lives. Running everything on Opus is the most common waste. Running everything on Haiku is the most common quality cliff. The hierarchy matters.
Section 1 — When One Claude Isn’t Enough
Before you reach for multi-agent anything, run the task through the single-agent test. From Agent Workflow Patterns: “First try your pipeline as a single agent, where the steps are just part of the prompt.” If quality holds, you’re done. No orchestration. No primitive choice. No cost overhead from inter-agent communication.
The signs that single-Claude is still the right answer:
- The whole job fits in one v3 artifact (see Module 1).
- The output is short enough that voice drift isn’t a risk.
- You’re not asking the same model to wear two incompatible hats simultaneously (e.g., creative drafter AND strict rules-checker — the rules pass works better as a separate phase).
- You don’t need different permission scopes or model tiers across stages.
If you check all four, ship the single-agent version. Move on.
The signs that you’ve outgrown single-Claude:
- The artifact is over 200 lines and the rules contradict the examples.
- Voice quality drops noticeably toward the end of long outputs (drift).
- One stage needs a different model tier than another (research wants Sonnet’s analytical bite, drafting wants Opus’s creative range).
- You want to validate output against rules in a separate pass rather than trusting the same model that just produced the draft.
- The work spans multiple sessions or runs on a schedule.
When you’re past single-Claude, the next question is which primitive. Three options. The full breakdown lives in Claude Agent Hierarchy; here’s the working version.
Subagents — isolated parallel workers
A subagent is a sub-task that runs in its own context window with its own scoped permissions. It returns a result to the parent and exits. No persistence between calls. No inter-agent communication. Built into Claude Code; available in Cowork.
Reach for subagents when:
- Tasks are independent — research, then write, then voice-check, with each stage handing off cleanly.
- You want the main context window to stay clean (offload heavy reading to a subagent that returns a summary).
- Different stages need different permission levels (a read-only research agent vs. a writing agent that can edit files).
- You’ll save the agent definition in
.claude/agents/and reuse it across projects.
Cost: free beyond standard token usage. Stable. The default choice for parallelism. This is the primitive Section 5’s worked example uses.
Agent Teams — collaborative peers
Agent Teams are multiple Claude instances that can talk to each other during a task — peer-to-peer coordination, not hub-and-spoke. Experimental in early 2026.
Reach for Agent Teams when:
- Multiple agents need to share discoveries in real time during the task (one agent finds a constraint mid-task, another needs to know).
- Tasks require agents to challenge each other (a reviewer raising flags while a builder is still building).
- One agent’s decisions affect another’s work and you want them to coordinate, not just sequence.
Cost: free beyond standard tokens, slightly higher token consumption from inter-agent chatter. Use when coordination matters more than independence.
Managed Agents — hosted long-running infrastructure
Managed Agents run on Anthropic’s infrastructure with built-in OAuth, checkpointing, and persistence. You define them in YAML or natural language, point them at a job, and let them run.
Reach for Managed Agents when:
- The agent runs continuously or on a schedule (monitor a Slack channel, poll a queue, run nightly).
- You need OAuth integrations without building token-management plumbing.
- The job takes hours or days and survives session boundaries.
- You don’t want to run any infrastructure yourself.
Cost: 58/month baseline for 24/7 operation. Worth it when the alternative is your team running a cron job + handling auth refreshes + writing checkpoint logic.
Decision tree
The single sentence to remember:
Single-Claude first. If that breaks, subagents for independent parallel work. Agent Teams when peers need to talk. Managed Agents when the workflow runs on a schedule or needs hosted infrastructure.
These tiers can be combined. A Managed Agent can spawn subagents when an event fires. An Agent Team working on a refactor can dispatch subagents for independent research subtasks. The right mental model: pick the simplest primitive that handles your hardest requirement, then compose upward only when needed.
A note on Cowork’s Dispatch
Cowork’s Dispatch is the Operator-track equivalent of “send Claude off to do work and come back later.” Under the hood, Dispatch tasks compose nicely with subagent decomposition — you describe the job, Claude breaks it into stages, runs them in its own desktop workspace, and surfaces the result. If you’re not on the Builder track but need async multi-stage work, Dispatch is your entry point. Module 5 covers it in depth alongside Routines.
Section 2 — The Three Workflow Patterns
Once you’ve decided multi-agent is the right answer, the next question is the shape of the work. Anthropic’s taxonomy from Agent Workflow Patterns names three patterns that cover almost every real-world case. Default to Sequential. Upgrade only when forced.
Sequential — “step B needs step A’s output”
The simplest pattern. Each stage’s output feeds the next stage’s input. Linear. Easy to reason about. Easy to debug — when something goes wrong, you walk back through the chain to find which stage broke.
When it’s the right shape:
- Natural dependencies between stages. Research happens before writing. Writing happens before voice-checking.
- Each stage adds specific value the next stage relies on.
- You can describe the flow as “first… then… then…” without contortions.
Cost relative to single-shot: roughly 1.2x to 2x the tokens of a comparable single-agent prompt, depending on how much context the orchestrator passes between stages. The savings come from running cheaper model tiers at stages that don’t need Opus.
How to recognize it in the wild: any pipeline that has obvious phases, each with its own deliverable. The Smile Springs blog assistant in Section 5 is sequential. So is content-translation work where you draft in English then localize. So is most “draft then edit” content workflow.
A concrete one-paragraph example from the dental marketing context: Mel needs a campaign brief for Smile Springs’ summer kids’ check-up promotion. Stage 1 — a research agent pulls competitor messaging, current pediatric-dentistry trends, and Smile Springs’ own historical campaign data. Stage 2 — a strategy agent reads the research and produces a positioning statement plus three concept directions. Stage 3 — a copywriting agent takes the chosen direction and drafts three variants of headline, subhead, and body. The orchestrator passes each stage’s output forward as the next stage’s context. Each stage runs the right model tier for its job — analytical work on Sonnet, creative work on Opus, none on Haiku because nothing in this pipeline is high-volume derivative work.
Parallel — “tasks are independent but serial is slow”
When the subtasks don’t depend on each other, run them simultaneously. Three sub-patterns:
- Sectioning — different agents handle different aspects of the same problem. (One reviews the SEO of a page, another reviews accessibility, another reviews voice. Same input, different lenses, parallel execution.)
- Evaluation — each agent assesses a different quality dimension. (One agent checks for banned phrases, another checks reading level, another checks brand voice fidelity. All run on the same draft, all return scores, the orchestrator combines.)
- Voting — multiple agents independently produce candidates and the orchestrator picks the strongest, or aggregates them. (Three agents each draft a headline; the orchestrator selects.)
When it’s the right shape:
- Subtasks are genuinely independent — neither needs the other’s output.
- Latency matters and serial would be too slow.
- You want multiple perspectives on the same input rather than a single verdict.
Cost relative to single-shot: parallel doesn’t save tokens — it spends roughly the sum of all parallel agents’ tokens. What it saves is wall-clock time. If three reviews each take 30 seconds and you run them in parallel, the user waits 30 seconds total instead of 90.
The trap with parallel: aggregation. “Design your aggregation strategy before implementing parallel agents.” If you can’t write the combiner before the fan-out, parallel isn’t ready. A common failure mode is parallel reviewers each flag different issues with no scoring rubric, and the orchestrator has no way to resolve disagreements. Define how outputs combine first — voting threshold, score weighting, conflict resolution — then build the parallel agents.
A concrete example: the orchestrator dispatches three parallel reviewers to evaluate a Smile Springs landing page draft. Reviewer A scores SEO (title tags, header structure, internal-link density). Reviewer B scores accessibility (alt text, contrast notes, ARIA hints). Reviewer C scores brand voice (banned phrase pass, tone consistency, audience fit for Columbus families). All three return structured JSON with category scores and specific flags. The orchestrator combines the three reports into a single prioritized punch list, weighted by Mel’s priorities (voice issues are P0; accessibility is P1; SEO nits are P2 unless the score drops below 70).
Evaluator-Optimizer — “first draft quality isn’t good enough”
A loop. One agent produces a draft, another evaluates it against criteria, the first agent revises based on the evaluation, repeat. Stops when the evaluator passes the draft or you hit the iteration cap.
When it’s the right shape:
- Clear, measurable quality criteria the evaluator can apply consistently.
- The first-attempt-to-final gap is meaningful enough to justify the loop overhead.
- The cost of a bad final output is high enough to pay for the iteration.
Cost relative to single-shot: the highest of the three patterns. A 3-iteration loop is roughly 4x the tokens of a single draft (three drafts plus three evaluations) — sometimes more if the evaluator is on a higher tier than the drafter. Worth it when the alternative is human editing time at agency rates.
The trap: stopping criteria. “Set clear stopping criteria before you start.” Without explicit stops — max iterations, quality threshold, time budget — evaluator-optimizer loops either run forever or oscillate between two close-but-not-quite versions. Pick at least one stop condition before you build the loop.
A concrete example: Smile Springs needs the year’s most important blog post — a comprehensive new-patient guide that will rank for “new dentist Columbus Ohio” and convert visitors at 5x the site’s average rate. Stage 1 — drafter writes the full 2,500-word guide on Opus. Stage 2 — evaluator on Sonnet scores it against a 12-point rubric (voice, scannability, internal-link strategy, CTA placement, factual accuracy on Smile Springs’ offerings, kid-friendliness of the family-section voice, etc.). If any rubric item fails, the evaluator returns specific revision notes. Stage 3 — drafter revises against the notes. Loop until the evaluator passes all 12 items OR three iterations complete, whichever comes first. The loop never runs more than three times in practice because rubric items are concrete enough that one revision usually clears most of them.
Patterns nest
The patterns aren’t mutually exclusive — they compose. An evaluator-optimizer loop can use parallel evaluation. A sequential pipeline can parallelize at a bottleneck stage. Section 5’s Smile Springs blog assistant is sequential at the top level; if you wanted to add SEO and meta-description derivation after the draft, those run in parallel because neither depends on the other. The composed shape is “sequential-then-parallel.”
The decision rule: pick the simplest pattern that handles the hardest stage. Add complexity only at the stages that justify it.
Section 3 — Live Demo: /ultrareview
/ultrareview is Anthropic’s own multi-agent system, shipped as a slash command in Claude Code. It runs five parallel reviewers on a pull request — security, performance, design, tests, accessibility — and synthesizes a single prioritized review. It’s a real production multi-agent system you can watch run on your own code.
This is the live observable example for this module. Operator-track learners watch a Builder run it; Builder-track learners run it themselves. Either way, you see what spawns, what each parallel agent does, and what the synthesized output looks like.
What spawns
When you invoke /ultrareview against a PR, the orchestrator dispatches five subagents in parallel. Each one:
- Reads the PR diff with its own scoped context (no cross-contamination between reviewers).
- Runs against a specialized rubric — the security reviewer doesn’t know or care about CSS, the design reviewer doesn’t know or care about SQL injection.
- Returns a structured report with severity-ranked findings.
The five reviewers are not one model running five times. They are five distinct subagents with different system prompts and different rubrics. From ultrareview article, the rubric specialization is the whole point — generic “review this PR” prompts under-perform specialized parallel reviewers because no single prompt can hold all five lenses with equal rigor.
What each agent does
- Security reviewer. Scans for common vulnerability patterns — auth bypasses, injection vectors, secret leaks, unsafe deserialization, missing CSRF protection. Cross-references the diff against the broader codebase to flag changes that introduce regression risk.
- Performance reviewer. Looks for N+1 queries, unnecessary re-renders, blocking operations on the request path, missing indexes, memory leaks, bundle-size regressions.
- Design reviewer. Reviews UI changes against the design system — token usage, spacing rhythm, accessibility (WCAG), responsive behavior, consistency with established patterns.
- Tests reviewer. Checks coverage of changed lines, missing edge cases, fragile test patterns (timing-dependent, network-dependent), tests that assert implementation rather than behavior.
- Accessibility reviewer. Specifically targets WCAG compliance and keyboard navigation, screen-reader semantics, focus management, contrast ratios.
Each runs in its own context window. Each returns a structured report. The orchestrator collects all five and synthesizes.
What the synthesized report looks like
The orchestrator’s job after the parallel fan-out is to produce a single prioritized punch list — not five disconnected reports. The synthesis groups findings by severity (P0 / P1 / P2 / P3), deduplicates issues that multiple reviewers caught, and surfaces the cross-cutting concerns. A line that’s both a security risk AND breaks accessibility ranks higher than either issue alone because two specialists independently flagged it.
The output you’d see in a Smile Springs context — say, you’re running /ultrareview against a PR that updates the Smile Springs new-patient form:
Ultrareview Report — PR #142 (new-patient-form-redesign)
P0 (3 issues)
- [security + accessibility] Form submits PII over GET. The query
string exposes patient names + DOBs in browser history and server
logs. (Security flagged, Accessibility flagged the resulting
screen-reader leak in the URL bar.)
- [security] No CSRF token on submit endpoint.
- [accessibility] Form fields lack <label> association with
inputs — fails WCAG 2.1 1.3.1.
P1 (5 issues)
...
P2 (4 issues)
...
The synthesis step is itself a Claude call — the orchestrator runs Opus with the five reviewer reports as context and produces the unified punch list. From your seat, you submit one command and read one report. Underneath, six Claude calls happened.
Identifying the workflow pattern
/ultrareview is parallel at the review stage and sequential between the parallel fan-out and the synthesis. The composition: sequential(parallel-fan-out, synthesis). It’s not evaluator-optimizer — there’s no loop, no revision, no iteration cap. The reviewers don’t talk to each other (so it’s not Agent Teams; it’s subagents). It runs to completion in one pass.
If you’re running through Section 3 as a Section 6 / Try It exercise, the diagnostic question is: which workflow pattern does /ultrareview implement, and which agent primitive does it use? Answer: parallel-then-sequential composition, implemented with subagents. If you can name those, you’ve internalized the architecture.
Why this is the live example for this module
/ultrareview is shipped, observable, and re-runnable. You can run it on a sandbox PR right now and watch the output. The Smile Springs example in Section 5 is the one you build yourself; /ultrareview is the one Anthropic built and the one you watch run. Together they cover both ends of the learn-by-watching / learn-by-doing axis.
Section 4 — Model Tiering Inside Multi-Agent Systems
The single biggest cost lever in multi-agent design is matching the model tier to the agent’s job. Get this right and a 4-stage pipeline costs less than a 1-stage Opus pass. Get it wrong and you’ve built an expensive way to produce mediocre output.
The hierarchy as of late April 2026, with rough pricing (confirm against Opus 4.7 Best Practices if you’re sizing budgets):
| Tier | Pricing (per Mtok) | Best for |
|---|---|---|
| Opus 4.7 | 75 output | Orchestration, creative writing, hard reasoning, multi-criteria judgment |
| Sonnet 4.6 | 15 output | Analysis, summarization, structured editing, voice-checking with rubrics, the workhorse of most pipelines |
| Haiku 4.5 | 5 output | High-volume derivative work — extracting fields, classifying, simple transforms, voice-pass against an explicit checklist |
The 5x to 15x cost spread between tiers is the budget you’re spending — or saving — on every agent decision.
How to assign tiers
The decision rule: pick the lowest tier that holds quality. Opus is over-spec for most stages. Haiku is under-spec for most stages. Sonnet does most of the work in most pipelines.
Concrete heuristic:
- Orchestration logic — Opus. The orchestrator makes high-stakes decisions about what to do next, and orchestrator errors cascade into every downstream stage. Don’t cheap out here.
- Generative creative writing — Opus. The drafting stage of any content workflow. The cost is real but the quality gap to Sonnet on creative work is also real, and you can’t recover voice quality at later stages if the draft is generic.
- Structured analysis, summarization, editing, validation — Sonnet. This is most pipelines’ middle stages. Sonnet is fast, accurate, cheaper than Opus, and rarely the bottleneck on quality.
- High-volume rule-checking, classification, field extraction — Haiku. Anything where the agent’s job is “scan for X, return Y” against an explicit rubric. Haiku 4.5 is more capable than Haiku 3 was at running structured rubrics — it’s now genuinely usable for voice-pass and lint-style work.
What goes wrong when tiers are mismatched
Haiku doing orchestration. Saves money, costs reliability. The orchestrator decides which subagent runs next, which model tier to dispatch, how to combine outputs. Haiku misroutes work, hands wrong context forward, or stalls when the situation isn’t clean. Pipelines built this way get debugged by upgrading the orchestrator to Sonnet or Opus and running the same job — usually fixing 80% of intermittent failures.
Opus doing classification. The opposite problem. You’ve spent 5 with the same accuracy. On any pipeline that runs frequently, this is the difference between a 20/month one. The most common cause: someone built the pipeline on Opus end-to-end and never went back to right-size individual stages.
Sonnet doing creative writing where the brand demands voice fidelity. Sonnet writes well but Opus writes more distinctively. For Smile Springs’ new-patient guide — the kind of post that ranks AND converts — the voice gap between Sonnet and Opus is worth the price difference. For an internal status update or a sales-enablement bullet, it’s not.
The discipline: every time you add a stage to a pipeline, ask “what’s the lowest tier that holds quality for this specific job?” Don’t default to whatever model the surrounding stages use. Multi-agent systems compose tiers; that’s part of why they save money over single-shot.
Section 5 — Worked Example: Smile Springs Blog Assistant
This is the centerpiece. You’re going to build a sequential 3-subagent pipeline that produces production-ready Smile Springs blog posts. The pipeline has three stages: research, write, voice-check. Each runs the right model tier for its job. The orchestrator passes data forward and validates between stages.
Audience: Mel needs a steady cadence of Smile Springs blog posts. Topics rotate — Invisalign, kids’ first dental visit, sleep apnea screening, Saturday emergency care, etc. Each post is 800 words, in Smile Springs’ warm-and-trustworthy voice, cited against current dental industry sources, and free of banned phrases.
Single-Claude version of this works. It’s also the failure case: voice drifts in long posts, the same topic in two weeks reads slightly different, and running everything on Opus costs more than it should. Multi-agent fixes all three.
The orchestrator
The orchestrator is a slash command (Builder track) or a Project instruction (Operator track running through Cowork’s Dispatch). Its job is to dispatch the three subagents in order, hand each one the context it needs, and validate output between stages.
<orchestrator>
You are the orchestrator for the Smile Springs blog pipeline.
Topic: {input from user — e.g., "Invisalign for adults 35-55"}
Workflow:
1. Dispatch the research subagent. Pass the topic and the
Smile Springs brand context. Receive structured findings.
2. Validate the research output: does it have at least 3 cited
sources with URLs? If not, request a re-run before proceeding.
3. Dispatch the write subagent. Pass the topic, brand context,
research findings, and the v3 FAQ-generator-style prompt
pattern (banned phrases, voice rules, examples).
4. Validate the draft output: word count between 700-900,
no banned phrases on a quick scan, has H2 structure.
5. Dispatch the voice-check subagent. Pass the draft.
Receive pass/fail with specific flags.
6. If voice-check fails, return findings to the user.
Do not loop — Mel edits and re-runs.
7. If voice-check passes, return the final draft.
Models:
- You (orchestrator): Opus 4.7
- Research subagent: Sonnet 4.6
- Write subagent: Opus 4.7
- Voice-check subagent: Haiku 4.5
</orchestrator>
Note the absence of a loop. This is sequential, not evaluator-optimizer. Mel gets a draft + a voice-check report; she decides whether to ship, re-run with edits, or escalate. Adding the loop is a Module-5 evolution once Mel has run this enough times to trust the rubric.
Subagent 1 — Research (Sonnet 4.6)
<role>
You are a dental industry researcher with 8 years of family-practice
content experience. You produce structured findings, not opinions.
You cite sources. You flag when claims are speculative.
</role>
<context>
Practice: Smile Springs Family Dental
Location: Columbus, Ohio
Audience: families with kids, adults 35-55
Topic to research: {topic from orchestrator}
</context>
<task>
Find 3 to 5 current sources (2025-2026) on this topic relevant to
new-patient dental marketing. Prioritize: ADA publications, current
clinical guidelines, recent patient-survey data, current Invisalign
or relevant-treatment statistics, Columbus or Midwest market specifics.
For each source, extract:
- The single most useful claim for a Smile Springs blog post
- The source URL and publication date
- Whether the claim is patient-facing (good for the post) or
clinical-facing (cite for credibility, don't quote directly)
</task>
<output_format>
{
"topic": "{topic}",
"sources": [
{
"claim": "string — most useful claim, in plain English",
"url": "string",
"published": "YYYY-MM",
"use": "patient-facing | clinical-citation",
"speculation_flag": "boolean"
}
],
"research_summary": "string — 2-3 sentences synthesizing what
the patient-facing post should know"
}
</output_format>
Why Sonnet: research and structured extraction are exactly Sonnet’s strength. Opus would over-think and produce richer prose; Haiku would miss the nuance on which claims are patient-facing vs. clinical-citation.
Example output for “Invisalign for adults 35-55”:
{
"topic": "Invisalign for adults 35-55",
"sources": [
{
"claim": "Adult Invisalign cases now account for over 30% of all clear-aligner starts, up from under 20% in 2020.",
"url": "https://example-aligner-data-source-2026.com/adult-trends",
"published": "2026-02",
"use": "patient-facing",
"speculation_flag": false
},
...
],
"research_summary": "Adult Invisalign demand is concentrated in the 35-55 segment, driven by both cosmetic motivation and bite-correction needs from years of untreated minor crowding. The post should lead with the practical question — does this fit my life — over the cosmetic angle alone, because adults in this segment are price-conscious and time-conscious."
}Subagent 2 — Write (Opus 4.7)
<role>
Senior dental marketing copywriter, 10 years of family-practice
experience. Direct-response background. Allergic to corporate jargon.
You write in the voice of a real human talking to other real humans.
</role>
<context>
Practice: Smile Springs Family Dental
Location: Columbus, Ohio
Audience: families with kids, adults 35-55
Voice: warm, plainspoken, trustworthy — not clinical
Differentiator: Saturday appointments, no-wait booking
</context>
<research_findings>
{full JSON output from Subagent 1}
</research_findings>
<task>
Write an 800-word blog post for the Smile Springs blog on the topic
in <research_findings>. Use the patient-facing claims from the
research as the structural backbone. Cite the clinical sources
parenthetically — never as wikilinks or named citations in the body
prose. The post should rank for the topic AND convert visitors to
booking the no-wait appointment.
</task>
<rules>
- Word count: 750-850
- 4-6 H2 sections
- Lead with the patient's question, not the practice's offering
- Mention Saturday appointments at most once, organically — never
as the headline value
- No banned phrases (see <banned_phrases>)
- No clinical jargon unless the topic requires it
- No rhetorical-question openers
- Active voice
- Close with one specific next step (book online, call, etc.)
</rules>
<banned_phrases>
streamline, leverage, world-class, game-changer, state-of-the-art,
revolutionize, dazzle, "in today's fast-paced world",
"are you tired of", "more than ever"
</banned_phrases>
<examples>
{2-3 paragraph snippets from prior on-brand Smile Springs blog posts}
</examples>
<output_format>
Markdown. Title (H1). 4-6 H2 sections. Final paragraph with one CTA.
After the post, include a <metadata> block:
- Word count
- Sources cited
- Banned-phrase check (self-report)
</output_format>
Why Opus: the drafting stage is where voice quality lives. Sonnet writes capably; Opus writes distinctively. The voice gap between them shows up most clearly in mid-paragraph sentence rhythm and in how organically the differentiator (Saturday appointments) gets woven in vs. dropped in as a callout. For Smile Springs’ lead conversion content, that gap is worth the price.
Subagent 3 — Voice-check (Haiku 4.5)
<role>
You are the Smile Springs voice-check skill. Your only job is to scan
a draft against an explicit rubric and report findings. You do not
rewrite. You do not improve. You report.
</role>
<rubric>
1. Banned-phrase scan. List every match from the banned-phrase list:
streamline, leverage, world-class, game-changer, state-of-the-art,
revolutionize, dazzle, "in today's fast-paced world",
"are you tired of", "more than ever"
For each match, return the line number and the offending phrase.
2. Voice-deviation scan. Flag sentences that read clinical, corporate,
or generic. Specifically:
- Sentences starting with "At Smile Springs Family Dental, we..."
- Sentences using "our commitment to..."
- Sentences with passive constructions where active would work
- Sentences over 30 words
3. Clinical-tone-creep. Flag any clinical jargon used without
patient-facing context. Examples to flag: gingivitis, prophylaxis,
periodontitis, occlusion, malocclusion — UNLESS the surrounding
sentence translates the term in plain English.
4. CTA presence. Confirm the post closes with a specific next step
(book online, call, etc.). If absent, flag.
</rubric>
<task>
Run the rubric against the draft in <draft>. Return a JSON report.
</task>
<output_format>
{
"banned_phrases": [
{"line": N, "phrase": "string"}
],
"voice_deviations": [
{"line": N, "sentence": "string", "issue": "string"}
],
"clinical_creep": [
{"line": N, "term": "string", "context": "string"}
],
"cta_present": "boolean",
"overall": "pass | fail",
"notes_for_human": "string — one sentence summary"
}
</output_format>
Why Haiku: this is the textbook job for Haiku 4.5. Explicit rubric, structured output, scan-and-report work. Running this on Opus would cost ~15x more for output identical in accuracy. Running it on Sonnet would cost ~3x more for output 90% as good. Haiku 4.5 is the right tier.
Cost comparison
The single-shot Opus version of this job — one prompt, one Claude call, ~3,000 input tokens (brand context + rules + examples) and ~1,500 output tokens (the 800-word post) — costs roughly $0.16 per run.
The 3-subagent pipeline runs:
- Research subagent (Sonnet): ~2,000 input + ~600 output. About $0.015.
- Write subagent (Opus): ~3,500 input (brand + research handoff) + ~1,500 output. About $0.16.
- Voice-check subagent (Haiku): ~2,000 input (rubric + draft) + ~400 output. About $0.004.
- Orchestrator overhead (Opus, light): ~500 tokens total. About $0.01.
Pipeline total: roughly $0.19 per run. Slightly more than single-shot Opus.
But: the pipeline produces a draft AND a voice-check report. The single-shot version produces a draft only. To match output quality, single-shot Opus needs a second pass — re-running the voice check, which adds another ~0.21. Pipeline beats single-shot when you require the voice-check artifact, which Smile Springs does.
The bigger savings come at scale. A 50-post-per-month cadence on the pipeline:
- Pipeline: $9.50/month
- Single-shot Opus + voice pass: $10.50/month
Modest. But run the same pipeline against research stages that don’t use the original Sonnet routing — say, you Opus everything for “consistency” — and the pipeline costs balloon to ~16/month at the same cadence. The discipline of right-sizing each stage is what makes the pipeline cheaper, not the pipeline itself.
The other gain — and this matters more than the dollars — is consistency. Voice variance across runs collapses because each stage has a single specialized job, the same prompt every time, the same tier every time. Mel ships posts that read like they came from one writer because they did: the same v3 prompt scaffold drove every draft.
Composing upward
This is a sequential pattern. If you also wanted parallel SEO and meta-description derivation, you’d add Subagent 4 (SEO check on Sonnet) and Subagent 5 (meta-description writer on Sonnet) running in parallel after Subagent 2 — that’s a sequential-then-parallel composition. Module 5 covers how to schedule this kind of pipeline as a recurring Routine so the whole thing runs on a Friday-morning cron and Mel finds drafts in her review queue Monday morning.
Common Pitfalls
The four mistakes that show up most often when teams start building multi-agent systems.
Subagent prompts that don’t pass enough context forward. The research subagent finds 5 great sources. The write subagent receives “topic: Invisalign” and a generic brand prompt — none of the research findings made it into the handoff. Result: the writer hallucinates statistics, ignores the research, or produces a generic post that could have been written without the research stage. Fix: the orchestrator’s job is to pass full structured outputs between stages, not summaries. If Subagent 1 returns 800 tokens of structured JSON, Subagent 2 gets all 800 tokens of it as context. Token cost is real but tiny relative to the value of not throwing away the previous stage’s work.
Orchestrator that doesn’t validate subagent output before passing it on. Subagent 1 returns broken JSON, or returns 1 source instead of 3. Subagent 2 receives the bad output and produces broken downstream work. By the time you spot the error, three stages have run. Fix: the orchestrator validates every handoff. Concrete validation, not “looks fine” — “does the JSON parse, does it have the required fields, are there at least 3 sources, are the URLs well-formed.” If validation fails, the orchestrator either retries the previous stage with a fix-up prompt or surfaces the error to the human. This is what turns a brittle pipeline into a reliable one.
Mismatched model tiers. The two failure modes from Section 4: using Haiku for orchestration (saves money, costs reliability) and using Opus for derivative work (the most common waste). Audit your tiering after the pipeline has run for a week. The orchestrator should be Opus or Sonnet. The drafting stage should usually be Opus for brand-voice work, Sonnet for analytical work, never Haiku. The rubric-and-scan stages should usually be Haiku 4.5. If your pipeline doesn’t follow this shape, you have either a quality leak or a budget leak — find which.
Treating multi-agent as the answer when single-Claude was fine. The most expensive mistake in this module. Roughly 3 out of 4 times someone says “I should use multi-agent for this”, the right answer is a better v3 prompt artifact (Module 1). Multi-agent has real overhead — orchestrator complexity, debugging across agents, token cost from inter-stage handoffs — and the overhead isn’t worth it for jobs single-Claude handles well. The single-agent test from Section 1 is the gate: if single-Claude with a v3 prompt produces acceptable output, multi-agent is over-engineering. Build single-agent first. Upgrade only when single-agent breaks.
Key Takeaways
- Single-Claude first. Try the whole pipeline as one well-prompted v3 artifact. Most jobs end here. Multi-agent is the upgrade when single-Claude breaks, not the default.
- Workflow shape and agent primitive are separate decisions. Sequential / Parallel / Evaluator-Optimizer is the shape. Subagents / Agent Teams / Managed Agents is the primitive. Pick shape first, then primitive.
- Default to sequential. It’s the simplest pattern, easiest to debug, and covers most real workflows. Add parallel when subtasks are independent and latency matters. Add evaluator-optimizer when first-draft quality reliably falls short and you have rubric-grade quality criteria.
- Default to subagents. Free, stable, the right primitive for most multi-agent work. Move to Agent Teams when peers need to coordinate. Move to Managed Agents when the workflow runs continuously or on a schedule.
- Tier the models. Opus for orchestration and creative work. Sonnet for analysis, structured editing, voice-checking with rubrics. Haiku for high-volume derivative work. Mismatched tiering is the most common waste in multi-agent design.
- Validate every handoff. The orchestrator’s real job is gating: don’t pass broken or thin output forward. Concrete validation rules, not vibes.
- Pass full context forward. Subagent 2 should receive the full structured output of Subagent 1, not a summary. Token cost is real but tiny relative to the cost of throwing away previous work.
- Aggregation strategy precedes parallelism. Stopping criteria precede iteration. Don’t build the fan-out before you’ve defined the combiner. Don’t build the loop before you’ve defined the stop conditions.
- Patterns nest. Sequential pipelines can parallelize at bottlenecks. Evaluator-optimizer loops can use parallel evaluation. Pick the simplest shape that handles your hardest stage, then compose upward.
/ultrareviewis the live observable example. Run it (Builder) or watch it run (Operator). Identify the workflow pattern (parallel-then-sequential) and the primitive (subagents). That’s the architecture you’re now equipped to build yourself.
Related
- Course index
- Module 1 — Prompts as Reusable Artifacts
- Module 2 — Skills at Depth
- Module 3 — Connecting Claude to Your Tools
- Managed Agents
- Evaluator-Optimizer
- Claude Code Subagents
- Claude Code Agent Teams
- Claude Managed Agents
- ultrareview — Anthropic’s multi-agent code review
- Opus 4.7 Best Practices — model tiering reference
Try It
[Operator] Watch /ultrareview run on a sandbox PR (~30 min)
- Find or ask a Builder teammate to run
/ultrareviewagainst a sandbox PR — any draft pull request from your team’s repos. The exercise works as well on a PR you don’t own. - Read the synthesized report. Note the severity grouping (P0 / P1 / P2 / P3) and how findings are deduplicated across reviewers.
- Identify the workflow pattern. Is it sequential, parallel, evaluator-optimizer, or a composition? Write down your answer before reading on. (Answer: parallel-then-sequential — five reviewers fan out in parallel, then a synthesis stage runs sequentially after.)
- Predict which model tier is doing each agent’s job. Write down a guess for each of the five reviewers and the orchestrator. Then check against the report’s runtime metadata if visible, or ask the Builder running it. (Most likely: orchestrator and synthesis on Opus, reviewers on Sonnet.)
- Pick one finding from the report — ideally a P0 — and walk through which reviewer caught it and why a single-Claude review prompt would likely have missed it. The point: specialization is what makes parallel review work.
You don’t need to run the command yourself for this exercise; you need to internalize the architecture by watching the artifact.
[Builder] Build the Smile Springs blog assistant (~45 min)
- Create the orchestrator command. Save the orchestrator prompt from Section 5 as
~/.claude/commands/smile-springs-blog.md. The command body is the orchestrator definition; it dispatches the three subagents. - Create the three subagents. Save each subagent definition (research, write, voice-check) as a file in
.claude/agents/with the exact role prompts from Section 5. Set the model tier on each subagent to match: Sonnet for research, Opus for write, Haiku for voice-check. - Run on three different topics. Pick: Invisalign for adults 35-55, kids’ first dental visit, sleep apnea screening. Run the command on each. Time each run and note the actual cost from your usage dashboard.
- Run the same brief through Opus single-shot. Take the same three topics and run them through a single Opus prompt with the equivalent v3 artifact (research + draft + voice-check inline). Time and cost.
- Compare. Pipeline vs. single-shot, on three dimensions:
- Cost per post
- Voice consistency across the three topics (re-read all three drafts back-to-back; do they sound like one writer?)
- Voice-check quality (does the pipeline catch issues the single-shot missed, or vice versa?)
- Document what you found in a short note saved next to the command file. If the pipeline wins, keep it as a reusable artifact. If single-shot wins on this specific job, document why — you’ve learned something about where multi-agent doesn’t pay.
You’ll know the exercise worked when running the command becomes a one-line ask: “Generate the post for Invisalign for adults 35-55.” The whole pipeline runs without you typing the orchestrator logic, the rubric, or the brand context every time. That’s the artifact compounding — same as Module 1’s v3 prompt, one tier up.
[Both] Redesign one existing workflow as multi-agent (~30 min)
- Pick one workflow you currently do as a single chat that takes 4+ messages. Examples: “research X, draft Y, then I edit Z.” “Brainstorm angles, pick one, write the brief, draft three variants.” “Review the spec, write the implementation notes, run them past the rubric.” Anything serial that you walk Claude through one phase at a time.
- Decide which workflow pattern it actually is. Sequential (most likely)? Parallel at any stage? Evaluator-optimizer if you keep iterating against quality criteria?
- Decompose into 2-3 explicit subagents. For each: what’s the role, what context does it need, what does it return, what model tier is right for it.
- Write down where the orchestrator validates between stages. Concrete rules, not vibes.
- Write down what model tier each subagent runs. Justify each one in one sentence.
- Decide whether to build it. If the workflow runs more than twice a month, build it as actual subagents (Builder track) or as a Cowork Dispatch task with a clear multi-stage prompt (Operator track). If it runs less than that, the design exercise is the value — file it for future use.
The goal isn’t to make every workflow multi-agent. The goal is to build the muscle of recognizing when a workflow has outgrown single-Claude, and designing the upgrade with the right shape and the right tiering. The more you do this, the smaller your output variance gets and the lower your token bill goes.
Done? Move on to Module 5 — Automation Primitives — where the multi-agent pipelines you’ve just built get scheduled to run on their own.