Source: ai-research/anthropic-console-prompting-tools-2026-07-02.md, ai-research/braintrust-vs-promptfoo-2026-07-02.md, ai-research/augmentcode-ai-agent-evaluation-tools-2026-07-02.md

“How do teams actually measure whether a prompt got better?” has three real answers as of mid-2026: build and iterate inside the Anthropic Console’s own Evaluate tool, run Promptfoo locally/in CI, or run Braintrust as a full evaluation-and-observability platform. They are not interchangeable — each optimizes for a different failure mode (blank-page prompt drafting, security red-teaming, or production regression control) — and one of the three (Promptfoo) changed hands in a way that matters for anyone picking a vendor-neutral tool.

Key Takeaways

  • “Anthropic Evals” in practice means the Console’s built-in Evaluate feature — test-case generation (manual, CSV import, or Claude-generated), one-click run-all, side-by-side prompt-version comparison, and subject-matter-expert 1–5 grading, sitting right next to the Console’s Prompt Generator and Prompt Improver. It’s zero-setup for teams already on the Claude Console, but it’s single-provider (Claude only) and not built for production monitoring or CI/CD gating.
  • Promptfoo is CLI-first, open-source, YAML-based, and red-teaming-heavy — 142 security plugins spanning OWASP LLM Top 10, OWASP Agentic, NIST AI RMF, and MITRE ATLAS, runs fully local or in CI, and supports 60+ model providers (including Anthropic). As of March 2026, Promptfoo agreed to be acquired by OpenAI — it will reportedly stay open-source under its current license, but the neutrality of a Promptfoo owned by a rival lab is a real factor for teams evaluating non-OpenAI models long-term.
  • Braintrust is a full evaluation-and-observability platform, not just a test runner: production traces, eval datasets, 25+ built-in scorers, LLM-as-judge, human review, and native CI/CD merge-blocking all share one data layer, so a bad production response can become a permanent regression test in one click. It’s commercial SaaS ($249/mo Pro tier after a free Starter tier), with a self-hosted Enterprise option.
  • The three tools sit at different points on the “testing vs. evaluation” spectrum. Testing is binary pass/fail against a fixed input (“this prompt broke”). Evaluation scores output quality across dimensions with numeric metrics over time (“this prompt scores 0.72 on faithfulness, down from 0.85 last week”). The Console’s Evaluate tool and Promptfoo sit closer to testing; Braintrust sits closer to continuous evaluation.
  • They compose rather than compete in practice. A common real-world pattern (noted directly in Braintrust’s own comparison) is running Promptfoo for red-teaming/security tests locally or in CI while running Braintrust for production tracing, evaluation, and release gating — using each tool for the slice it’s actually built for.

The Three Tools

Anthropic Console — Prompt Generator + Prompt Improver + Evaluate

Built directly into console.anthropic.com / platform.claude.com. The workflow: describe a task in plain language → the Prompt Generator produces a first-draft prompt template using Anthropic’s own best practices (XML structuring, chain-of-thought scaffolding, role framing) → the Prompt Improver iterates on an existing prompt for higher accuracy on complex tasks (at the cost of longer, slower responses — Anthropic’s own docs recommend simpler prompts for latency/cost-sensitive use cases) → the Evaluate tab generates or imports test cases (CSV, manual, or Claude-auto-generated), runs the full suite in one click, and lets you compare two or more prompt versions side by side with human 1–5 grading.

This is the lowest-friction option if you’re only ever testing against Claude and want to go from zero to a working test suite fast, without adopting a third-party platform. It is not designed for production monitoring, CI/CD gating, or cross-provider comparison — those are exactly where Promptfoo and Braintrust specialize.

Promptfoo — open-source, CLI-first, security-focused

Config lives in promptfooconfig.yaml next to application code; assertions cover both deterministic checks (equals, contains, regex, is-json) and model-graded evaluations (llm-rubric, factuality, context-faithfulness). Runs locally by default — data only leaves the machine for opt-in hosted features (remote grading, sharing, Cloud sync). Its strongest differentiator is red-teaming depth: 142 plugins mapped to OWASP LLM Top 10, OWASP Agentic, NIST AI RMF, MITRE ATLAS, and EU AI Act presets, plus a code-scans command that scans code diffs directly for prompt-injection and PII-exposure risk.

The free/Community tier includes 10,000 red-team probes/month; production self-hosting is explicitly not recommended by Promptfoo’s own docs (local SQLite, no horizontal scaling). Enterprise pricing is custom, sales-gated.

The OpenAI acquisition matters here specifically. Two independent 2026 sources (Augment Code’s tool comparison and a dedicated Promptfoo review site) both flag the same fact: Promptfoo agreed to be acquired by OpenAI in March 2026. The codebase stays open-source and multi-provider support (Anthropic, Google, Meta, Ollama) is unchanged today, but any team choosing Promptfoo specifically because it’s vendor-neutral for evaluating Claude against competitors should treat that neutrality as time-limited, not guaranteed.

Braintrust — evaluation-and-observability platform

Structured around a five-stage workflow (Instrument → Observe → Annotate → Evaluate → Deploy) where production traces and offline evaluation datasets share one underlying data store (Brainstore, purpose-built for AI trace workloads at scale). The signature feature: when a user reports a bad response, an engineer opens the trace and promotes it into the evaluation dataset with one click — production failures become permanent regression tests without manual re-creation. The braintrustdata/eval-action GitHub Action posts pass/fail scorer results directly as PR comments and can block merges below a quality threshold. An AI assistant (“Loop”) can generate scorers from a natural-language description and suggest prompt fixes from failure patterns.

Free Starter tier: 10K scores, 1 GB processed data, 14-day retention, unlimited users. Paid Pro: $249/month flat, 5 GB data, 50K scores, 30-day retention. Cited production users include Airtable, Vercel, Stripe, Zapier, and Instacart; Notion’s AI team reports going from triaging 3 issues/day to 30/day after adopting Braintrust’s workflow.

Which One For What

SituationPick
Iterating on a single Claude prompt, no third-party tool wanted yetAnthropic Console Evaluate
Security/red-team coverage (OWASP, NIST, MITRE ATLAS) before shippingPromptfoo
Multi-provider comparison, developer-centric team, config lives in gitPromptfoo
Production traffic needs continuous scoring + drift detectionBraintrust
CI/CD should block a PR when eval scores regressBraintrust (native) or Promptfoo (custom scripting required)
Turning real production failures into a permanent regression suiteBraintrust
Team includes non-engineers (PMs, domain reviewers) who need to grade outputsBraintrust (unlimited users on every tier) or Anthropic Console (SME 1–5 grading) — Promptfoo’s PM/reviewer collaboration is Enterprise-tier only
Long-term vendor neutrality for multi-model evaluation mattersWeigh Promptfoo’s OpenAI ownership; Braintrust and the Anthropic Console don’t have that specific conflict (though the Console is naturally Claude-only)

Try It

  1. If you’ve never built a prompt eval, start in the Anthropic Console. Use the Prompt Generator on a real task, then click into Evaluate and let Claude auto-generate 5–10 test cases. This is the lowest-friction way to see what “scoring a prompt” actually looks like before adopting a heavier tool.
  2. If you’re shipping anything with untrusted user input, run a Promptfoo red-team pass before launch — the OWASP/MITRE ATLAS presets cover prompt injection, jailbreaking, and PII exposure without you having to write the attack prompts yourself.
  3. If you already have a production Claude feature and no regression coverage, look at Braintrust’s free Starter tier before building custom eval infrastructure — the one-click “trace becomes a regression test” workflow is the highest-leverage habit for a small team.

Open Questions

  • Does Anthropic’s Console Evaluate tool support any form of CI/CD integration (e.g., a GitHub Action equivalent to Braintrust’s), or is it strictly an interactive Console workflow? Not addressed in the sources checked.
  • What does Braintrust’s or Promptfoo’s pricing look like at meaningfully higher volume (beyond the entry Pro/paid tiers cited here)? Both push to “contact sales” past the published tiers — no public data found.