Source: ai-research/anthropic-self-service-data-analytics-2026-06-04.md (official Claude Blog, claude.com/blog/how-anthropic-enables-self-service-data-analytics-with-claude, by Chen Chang, Clement Peng, Justin Leder, Johanne Jiao, Josh Cherry — Anthropic Data Science & Data Engineering); announced via @ClaudeDevs + @_catwu (raw/x-bookmarks-recent-digest-2026-06-04.md)
Anthropic’s data team explains how it automated 95% of internal business analytics queries with Claude at ~95% aggregate accuracy — and, more usefully, publishes the playbook: a four-layer agentic data stack (data foundations → sources of truth → skills → validation), the pairwise skill pattern that lifted eval accuracy from 21% to 95%+, and the maintenance discipline that stopped accuracy from rotting. It is the most concrete first-party guide yet for anyone building a “ask the data warehouse in plain English” agent.
Key Takeaways
- Headline numbers: 95% of business analytics queries automated, ~95% accuracy in aggregate (regularly ~99% in some domains). The data science team’s freed-up time goes to causal modeling, forecasting, and ML.
- “Data is not software.” Coding agents work in an open-ended solution space where docs and tests are natural hallucination guardrails. Analytics has one correct answer from one correct source with no deterministic way to prove correctness — the failure surface is the data’s ambiguity, not the model’s coding ability.
- The central problem: map a user’s question to specific, up-to-date entities in the data model and know the correct way to work with them. Get that right and “the resulting execution and SQL becomes trivial.” Three failure modes account for most wrong answers: ambiguity (forty plausible “revenue” datasets instead of one governed one), discoverability (the right answer exists but isn’t found), and staleness (docs describe yesterday’s schema).
- Skills are the leverage point: 21% → 95%+ eval accuracy. Without skills, Claude couldn’t exceed 21% on Anthropic’s analytics evals. With them: consistently above 95%. No other single layer moves the number like that.
- The pairwise skill pattern: a knowledge skill (thin top-level router — “try the semantic layer first; if no coverage, here are ~30 curated reference files for this domain”) + an unbook skill (the senior-analyst procedure: clarify → find sources → query → loop through adversarial review sub-agents, plus ~a dozen reusable analysis patterns like retention curves, rate decomposition, funnel analysis).
- Skill maintenance is an engineering problem, not a writing problem. Accuracy drifted ~95% → ~65% in one month when skill docs went unmaintained. Fix: colocate skill markdown in the same repo as the transformation models — the PR that changes a model updates the doc; a code-review hook flags reporting-model changes that don’t touch a skill file. ~90% of data-model PRs now include a skill change.
- One canonical skill, every surface. On merge, skills auto-sync to a plugin marketplace (IDE), cloud-storage blobs (hosted apps), and MCP resources — same answer in Slack, IDE, dashboards, and agent sessions.
- Evals before elaborateness. Two offline-eval families: dashboard-based (auto-generated by Claude, human-validated, covering common stakeholder questions) and long-tail (Claude generates plausible questions from roadmaps/table docs). Every stakeholder correction in a thread is harvested as a candidate eval.
- Ablations replace arguments. Every structural decision (expose this source? does a sub-agent earn its latency? merge two skills?) is made by holding the eval set fixed, varying one component, and comparing pass rates — each run takes about an hour.
- The unsolved failure mode is the silent one — a wrong-but-plausible answer used without objection. Mitigations: a provenance footer on every answer, human sign-off on anything leadership-bound, and a daily standing eval of each domain’s top KPIs against the blessed dashboard. Anthropic says outright it has no robust solution yet.
The four-layer stack
| Layer | What it is | Failure mode it attacks |
|---|---|---|
| Data foundations | Models, transforms, tests, tables + metadata; dimensional modeling, shift-left testing, freshness checks | Ambiguity (“revenue” resolves to one governed dataset) + first staleness defense |
| Sources of truth | Reference surfaces the agent consults to navigate the warehouse (in descending trust order) | Concept↔entity ambiguity (“weekly active users” → a governed entity) |
| Skills | Procedural knowledge: which sources, what order, what done looks like | Discoverability + process quality |
| Validation | Offline evals, ablations, online checks | Finds which failure mode still leaks |
A key reframe for data teams: the end user of the data model is no longer a data expert — it’s agents acting for users who can’t validate correctness themselves. Documentation quality stops being a nice-to-have and becomes the accuracy ceiling; Claude helps close the gap (drafting column descriptions, proposing metric docs, flagging undocumented models in CI) but curation stays human-owned.
The skill skeletons (the reusable artifact)
The appendix publishes two production skeletons worth copying:
- Warehouse skill skeleton — the real structure of Anthropic’s main analytics skill: trigger-scoped
description(IF…THEN…DO NOT), a mandatory semantic-layer-first workflow with pre-rebutted excuses agents use to bail to raw SQL, date-window/timezone conventions decided before querying, entity-disambiguation rules, mandatory adversarial SQL review by asql-reviewersub-agent (“do not self-certify”), PII protection (return the SQL for the user to run, not the results), and a provenance footer on every answer (source · confidence · reviewer · freshness · owner). - Reference-doc skeleton — per-domain table docs written for LLM retrieval: grain, scope/exclusions, when-NOT-to-use, gotchas (“the wrong-answer modes a senior analyst would warn you about”), and explicit routing triggers — without prescriptive query recipes that go stale.
Why it matters
- It’s the analytics counterpart to the coding-agent playbooks. The same primitives the wiki tracks for engineering (skills, sub-agents, adversarial verification, evals) applied to business questions — with first-party numbers showing where the leverage is (skills) and where it isn’t (raw warehouse access).
- The maintenance finding generalizes. “Docs describing a daily-changing system are wrong within weeks” + “colocate the doc with the thing it describes + hook-enforce the pairing” applies to any skill library, including this wiki’s own pattern and the Anthropic engineers’ skill rules.
- For agencies: the getting-started floor is low — “a handful of canonical datasets, a few dozen offline evals, and a thin knowledge skill will capture most of the upside.” That is a realistic scope for a marketing-agency reporting stack (GA4/GSC/CRM exports) without a data-engineering team.
Try It
- Pick one reporting domain (e.g., client-campaign performance) and write the thin knowledge skill: where the governed data lives, ~10-30 reference notes on tables/exports, and what “done” looks like.
- Write a few dozen offline evals as question/answer pairs from real stakeholder asks; add every correction you make to the agent’s answers as a new eval.
- Steal the appendix skeletons: semantic-layer-first (or “blessed export first”), adversarial review before final answers, and the provenance footer (source · confidence · freshness · owner) on every result.
- When you change the data model or export schema, change the skill doc in the same commit — and consider a hook that flags when you don’t.
Related
- Anthropic Engineers’ Four Skill Rules
- Agent Skills Overview
- Picking the Right Model with Evals
- Dynamic Workflows in Claude Code — the same adversarial-verification primitive, generalized
- Zero Trust for AI Agents — sibling Claude Blog operational guide
- How We Contain Claude — the Anthropic-internal-practices article family
- ant — the Claude Platform CLI
Open Questions
- Several in-page lists shipped as images and didn’t survive extraction: the three failure-mode definitions (named in prose elsewhere), the four sources-of-truth surfaces in descending trust order, the data-foundations practice list, and the online-validation step list. Re-extract with image understanding or revisit the page if those specifics are needed.
- The blog doesn’t name the warehouse/semantic-layer vendor stack (deliberately genericized in the skeleton).
- “Dashboard-based evals auto-generated by Claude” — cadence and human-validation workload aren’t quantified.