The Marketing Prompt Stack — Voice Patterns, Skill Packaging, and Evals

Source: wiki synthesis: Marketing-Specific Prompt Patterns, Single Grain), Prompt Evaluation Tools

Production marketing prompting is three problems wearing one trench coat: sounding like the brand, being reusable and versioned instead of retyped, and proving output quality before anything ships. The wiki solves each in a different topic — voice patterns in prompt-engineering, skill packaging in claude-ai, eval tooling back in prompt-engineering — and no article connects them. Stacked, they are the pipeline that turns an ad-hoc prompt into a governed asset.^[inferred]

Key Takeaways

Three layers, one pipeline. Voice — extraction from real writing samples, checkable ban lists, named-reviewer self-critique (voice patterns). Packaging — SKILL.md plus scripts, “these aren’t prompts, they’re complete workflows” (Siu repo). Evals — Console Evaluate, Promptfoo, Braintrust (evaluation tools).^[inferred]
The voice patterns and the Siu repo converge on the same techniques at different maturity. The repo’s X Long-Form Humanizer ships a 24-pattern AI-slop detector — a packaged, reusable instance of the checkable-ban-list pattern (Claude can verify against a list, not against a vibe). And Content Ops recursively scores drafts against domain-expert personas until quality hits 90+ — the named-reviewer self-critique pattern productized into a scoring loop.^[inferred identification; both halves sourced]
A ban list is a latent test suite. Every banned phrase or structural pattern (“No X. No Y. Just Z.” fragments, rhetorical-question openers) is a deterministic assertion. Promptfoo encodes exactly this class of check (contains, regex assertions in YAML, run locally or in CI); the voice article’s validation-with-retry <validation> block is the same check run inside the prompt instead of outside it.^[inferred bridge]
Judgment-call checks map to LLM-graded scoring. The named-reviewer critique (“read this as a reviewer who hates brochure-speak and rhetorical-question openers”) is the prompt-side version of what evals call llm-rubric (Promptfoo) or LLM-as-judge with human review (Braintrust); the Anthropic Console’s SME 1–5 grading is where the human reviewer plugs in. What changes when the check moves eval-side: it runs independently of the generation, on versioned test cases, with scores tracked over time (“0.72 on faithfulness, down from 0.85 last week”).^[inferred mapping]
Versioning plus gating is what makes it “governed.” Skills live in git (the Siu repo is MIT; install is copying a SKILL.md into .claude/skills/), and Braintrust’s GitHub Action posts eval pass/fail as PR comments and can block merges below a quality threshold — so a prompt edit that regresses brand voice can be stopped the same way a failing unit test stops a code change.^[inferred]
Order matters on both axes. Voice patterns compose in a fixed order (extract → draft → filter → critique; critique before extraction yields confident genericness with nothing real to check against). Eval tools compose along the lifecycle: Console Evaluate at drafting, Promptfoo in CI, Braintrust in production — where one click turns a real bad response into a permanent regression test.
Real statistics belong inside the stack. The Siu repo’s Growth Engine runs bootstrap confidence intervals and Mann-Whitney U tests for A/B significance inside a skill — the same rigor evals bring to output quality, applied to campaign outcomes.

The Pipeline: Ad-Hoc Prompt to Governed Asset

Harden the voice. Extract signature phrases, sentence patterns, and real stories from 2–3 genuine writing samples into <examples>; build the ban list (words and structural patterns), starting at ~5 rules and growing one per observed failure; add the named-reviewer critique step. Keep one brand identity per session — voice bleed is a session-level failure no prompt constraint catches.
Package as a skill. Follow the Siu repo’s shape: a SKILL.md the agent reads plus supporting scripts, each category self-contained with its own README and .env.example. The prompt stops being a pasted blob and becomes an installable, forkable unit invoked in natural language.
Attach evals. Start in the Anthropic Console (auto-generate 5–10 test cases, compare prompt versions side by side); encode the ban list as deterministic Promptfoo assertions for CI; once the asset serves production traffic, promote real failures into a Braintrust regression suite.
Gate changes. Re-run the suite on every prompt or skill edit; block the merge on regression. At this point the marketing prompt has the change-control properties of production code.^[inferred pipeline framing; each step’s tooling is sourced]

What Each Layer Catches That the Others Miss

The ban list catches enumerable tells — binary, checkable, growable. It misses judgment calls (“technically allowed but reads off-voice”).
The named-reviewer critique catches judgment calls — but it is self-graded inside the same generation, and the voice article notes a generic “check for quality” produces a rubber-stamp pass. Concrete pet peeves make it real; an external eval makes it independent.
The eval layer catches drift over time and across versions — the testing-vs-evaluation distinction: not “did this prompt break” but “is this prompt scoring worse than last week.” It cannot invent the brand’s voice rules; it can only enforce the ones the voice layer earned from observed failures.^[inferred contrast]
One vendor caveat crosses the stack: Promptfoo agreed to be acquired by OpenAI in March 2026. It stays open-source and multi-provider today, but a marketing stack that evaluates Claude output against competitor models should treat that neutrality as time-limited.

Try It

Build the voice layer for one recurring asset (e.g. a client’s social posts): pull 3 real writing samples, extract phrases/patterns/one real story into <examples>, and write a 5-rule ban list with at least one structural pattern.
Turn the ban list into assertions. Each banned phrase becomes a not-contains check — as Console Evaluate test cases if you want zero setup, or a promptfooconfig.yaml if you want it in git and CI.
Package it as a SKILL.md. Clone ericosiu/ai-marketing-skills and copy the shape of one category — Content Ops (expert panel → 90+ quality gate) is the closest reference for voice-governed content work.
Add one LLM-graded scorer whose rubric is the named reviewer’s pet peeves, verbatim. Keep the deterministic assertions separate from the judgment scorer so failures tell you which layer broke.
Gate it: re-run the suite before any prompt edit ships; if you adopt Braintrust, wire the GitHub Action so a regression blocks the merge.

Marketing-Specific Prompt Patterns — the voice layer (extraction, ban lists, named-reviewer critique)
Single Grain) — the packaging layer and its shipped quality gates
Prompt Evaluation Tools — the eval layer (Console vs Promptfoo vs Braintrust)
Module 1 — Prompts as Reusable Artifacts — the constraint-stacking and validation-with-retry techniques underneath the voice layer
Prompt Engineering Essentials — the foundational few-shot and self-correction techniques
Marketing Skills Bundle (Corey Haines) — sibling packaging-layer collection, copy/social-leaning
Claude Code as AI Operating System — the “run the skill, review, improve the workflow” operating frame the Siu repo instantiates

Open Questions

Can per-brand eval suites detect voice bleed? The voice article says bleed is a session-level failure with a procedural fix (one identity per session); whether a per-brand regression suite would catch bleed after the fact is untested in any source.^[inferred]
No source documents a marketing team running all three layers end-to-end. Each layer is production-tested on its own; the composed pipeline is a projection, not a case study.

Jonathon's AI Wiki

Explorer

The Marketing Prompt Stack — Voice Patterns, Skill Packaging, and Evals

Key Takeaways

The Pipeline: Ad-Hoc Prompt to Governed Asset

What Each Layer Catches That the Others Miss

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

The Marketing Prompt Stack — Voice Patterns, Skill Packaging, and Evals

Key Takeaways

The Pipeline: Ad-Hoc Prompt to Governed Asset

What Each Layer Catches That the Others Miss

Try It

Related

Open Questions

Graph View

Table of Contents

Backlinks