Source: wiki synthesis: Marketing-Specific Prompt Patterns, Single Grain), Prompt Evaluation Tools

Production marketing prompting is three problems wearing one trench coat: sounding like the brand, being reusable and versioned instead of retyped, and proving output quality before anything ships. The wiki solves each in a different topic — voice patterns in prompt-engineering, skill packaging in claude-ai, eval tooling back in prompt-engineering — and no article connects them. Stacked, they are the pipeline that turns an ad-hoc prompt into a governed asset.^[inferred]

Key Takeaways

  • Three layers, one pipeline. Voice — extraction from real writing samples, checkable ban lists, named-reviewer self-critique (voice patterns). PackagingSKILL.md plus scripts, “these aren’t prompts, they’re complete workflows” (Siu repo). Evals — Console Evaluate, Promptfoo, Braintrust (evaluation tools).^[inferred]
  • The voice patterns and the Siu repo converge on the same techniques at different maturity. The repo’s X Long-Form Humanizer ships a 24-pattern AI-slop detector — a packaged, reusable instance of the checkable-ban-list pattern (Claude can verify against a list, not against a vibe). And Content Ops recursively scores drafts against domain-expert personas until quality hits 90+ — the named-reviewer self-critique pattern productized into a scoring loop.^[inferred identification; both halves sourced]
  • A ban list is a latent test suite. Every banned phrase or structural pattern (“No X. No Y. Just Z.” fragments, rhetorical-question openers) is a deterministic assertion. Promptfoo encodes exactly this class of check (contains, regex assertions in YAML, run locally or in CI); the voice article’s validation-with-retry <validation> block is the same check run inside the prompt instead of outside it.^[inferred bridge]
  • Judgment-call checks map to LLM-graded scoring. The named-reviewer critique (“read this as a reviewer who hates brochure-speak and rhetorical-question openers”) is the prompt-side version of what evals call llm-rubric (Promptfoo) or LLM-as-judge with human review (Braintrust); the Anthropic Console’s SME 1–5 grading is where the human reviewer plugs in. What changes when the check moves eval-side: it runs independently of the generation, on versioned test cases, with scores tracked over time (“0.72 on faithfulness, down from 0.85 last week”).^[inferred mapping]
  • Versioning plus gating is what makes it “governed.” Skills live in git (the Siu repo is MIT; install is copying a SKILL.md into .claude/skills/), and Braintrust’s GitHub Action posts eval pass/fail as PR comments and can block merges below a quality threshold — so a prompt edit that regresses brand voice can be stopped the same way a failing unit test stops a code change.^[inferred]
  • Order matters on both axes. Voice patterns compose in a fixed order (extract → draft → filter → critique; critique before extraction yields confident genericness with nothing real to check against). Eval tools compose along the lifecycle: Console Evaluate at drafting, Promptfoo in CI, Braintrust in production — where one click turns a real bad response into a permanent regression test.
  • Real statistics belong inside the stack. The Siu repo’s Growth Engine runs bootstrap confidence intervals and Mann-Whitney U tests for A/B significance inside a skill — the same rigor evals bring to output quality, applied to campaign outcomes.

The Pipeline: Ad-Hoc Prompt to Governed Asset

  1. Harden the voice. Extract signature phrases, sentence patterns, and real stories from 2–3 genuine writing samples into <examples>; build the ban list (words and structural patterns), starting at ~5 rules and growing one per observed failure; add the named-reviewer critique step. Keep one brand identity per session — voice bleed is a session-level failure no prompt constraint catches.
  2. Package as a skill. Follow the Siu repo’s shape: a SKILL.md the agent reads plus supporting scripts, each category self-contained with its own README and .env.example. The prompt stops being a pasted blob and becomes an installable, forkable unit invoked in natural language.
  3. Attach evals. Start in the Anthropic Console (auto-generate 5–10 test cases, compare prompt versions side by side); encode the ban list as deterministic Promptfoo assertions for CI; once the asset serves production traffic, promote real failures into a Braintrust regression suite.
  4. Gate changes. Re-run the suite on every prompt or skill edit; block the merge on regression. At this point the marketing prompt has the change-control properties of production code.^[inferred pipeline framing; each step’s tooling is sourced]

What Each Layer Catches That the Others Miss

  • The ban list catches enumerable tells — binary, checkable, growable. It misses judgment calls (“technically allowed but reads off-voice”).
  • The named-reviewer critique catches judgment calls — but it is self-graded inside the same generation, and the voice article notes a generic “check for quality” produces a rubber-stamp pass. Concrete pet peeves make it real; an external eval makes it independent.
  • The eval layer catches drift over time and across versions — the testing-vs-evaluation distinction: not “did this prompt break” but “is this prompt scoring worse than last week.” It cannot invent the brand’s voice rules; it can only enforce the ones the voice layer earned from observed failures.^[inferred contrast]
  • One vendor caveat crosses the stack: Promptfoo agreed to be acquired by OpenAI in March 2026. It stays open-source and multi-provider today, but a marketing stack that evaluates Claude output against competitor models should treat that neutrality as time-limited.

Try It

  1. Build the voice layer for one recurring asset (e.g. a client’s social posts): pull 3 real writing samples, extract phrases/patterns/one real story into <examples>, and write a 5-rule ban list with at least one structural pattern.
  2. Turn the ban list into assertions. Each banned phrase becomes a not-contains check — as Console Evaluate test cases if you want zero setup, or a promptfooconfig.yaml if you want it in git and CI.
  3. Package it as a SKILL.md. Clone ericosiu/ai-marketing-skills and copy the shape of one category — Content Ops (expert panel → 90+ quality gate) is the closest reference for voice-governed content work.
  4. Add one LLM-graded scorer whose rubric is the named reviewer’s pet peeves, verbatim. Keep the deterministic assertions separate from the judgment scorer so failures tell you which layer broke.
  5. Gate it: re-run the suite before any prompt edit ships; if you adopt Braintrust, wire the GitHub Action so a regression blocks the merge.

Open Questions

  • Can per-brand eval suites detect voice bleed? The voice article says bleed is a session-level failure with a procedural fix (one identity per session); whether a per-brand regression suite would catch bleed after the fact is untested in any source.^[inferred]
  • No source documents a marketing team running all three layers end-to-end. Each layer is production-tested on its own; the composed pipeline is a projection, not a case study.