Refusal Calibration and Constitutional AI — Shaping Claude's Refusals Without Raising Jailbreak Risk

Source: ai-research/anthropic-claude-constitution-jan2026-2026-07-02.md, ai-research/anthropic-constitutional-classifiers-2026-07-02.md, ai-research/or-bench-over-refusal-benchmark-2026-07-02.md

How to shape Claude’s refusal behavior — and how to avoid over-refusal without opening a jailbreak — is not primarily a prompting trick. It is governed by a real, published priority hierarchy (Claude’s Constitution, rewritten January 22, 2026) and a measured trade-off that no known system-prompt phrasing escapes: pushing “be safer” pushes over-refusal up too. This article covers what actually shapes refusals, what operators can and cannot control via prompting, and the one technique (Constitutional Classifiers) that genuinely decouples the trade-off — which is Anthropic’s infrastructure, not something available through a system prompt.

Key Takeaways

Claude’s Constitution (Jan 22, 2026) sets a 4-tier priority stack: broadly safe > broadly ethical > compliant with Anthropic’s guidelines > genuinely helpful. In conflicts, Claude prioritizes in that order. It is written primarily for Claude (to explain reasoning, not just rules) and released CC0.
An operator’s system prompt sits within tier 4 (helpfulness), specifically the “operator principal” layer described in the Constitution — below Anthropic’s own domain-specific guidelines (tier 3) and far below the safety/ethics tiers. A system prompt can shape tone, scope, and format of refusals; it cannot instruct Claude to override the tiers above it.
There is no free lunch on safety-via-prompting. OR-Bench’s controlled test (adding an explicit “be helpful as well as safe” instruction to Claude-3-Opus and three other models) found refusal rates rose on toxic and benign prompts simultaneously — a measured, statistically real trade-off, not a Claude-specific bug. Claude sits at the high-safety/high-over-refusal end of that trade-off curve relative to the other models OR-Bench tested.
The one lever that decouples the trade-off is not prompting — it’s Anthropic’s own Constitutional Classifiers, input/output classifiers layered on top of the model. They cut jailbreak success from 86% to 4.4% while raising over-refusal by only 0.38% (not statistically significant on 5,000 real conversations). Operators don’t configure this; it runs at the platform level.
The prompting technique that measurably reduces false refusals is truthful, specific context — not adversarial reframing. Anthropic’s own over-refusal research (cited inside OR-Bench) shows successive Claude generations have significantly cut false-refusal rates specifically on prompts that are “harmless but superficially resemble something harmful.” Providing the real reason something is benign works because it is verifiably true — structurally the opposite of the successful jailbreak techniques (role-play framing, ciphers, keyword substitution) that Anthropic’s own red-team data shows actually bypass safeguards.

The Priority Stack (Claude’s Constitution, Jan 2026)

Anthropic replaced its 2023 list-of-principles constitution with an 80-page, narrative “reasoning, not rules” document on January 22, 2026. The stated goal: teach Claude why, not just what, on the theory that a model that understands the reasoning generalizes better to novel situations than one trained on rigid rules.

The published summary order, prioritized in cases of conflict:

Broadly safe — not undermining humans’ ability to oversee and correct AI during the current phase of development.
Broadly ethical — honest, good values, avoiding inappropriate/dangerous/harmful actions.
Compliant with Anthropic’s guidelines — more specific supplementary instructions Anthropic gives Claude for particular domains (medical advice, cybersecurity requests, jailbreaking strategies, tool integrations). These guidelines reflect context Claude doesn’t have by default and are meant to be subordinate to — never in conflict with — the constitution as a whole.
Genuinely helpful — benefiting the “principals” Claude interacts with: Anthropic itself, the operators who build on the API (i.e., whoever writes the system prompt), and the end users.

Two things matter for prompt design specifically:

“Hard constraints” are the exception to the reasoning-not-rules approach — a small set of bright-line behaviors (e.g., never provide significant uplift to a bioweapons attack) that Anthropic deliberately trains as rigid rules because predictability matters more than nuance for those cases. No system prompt reaches these.
The operator/system-prompt layer lives inside tier 4, and is explicitly positioned below Anthropic’s own domain guidelines (tier 3). This gives a concrete mental model: your system prompt can shape what “helpful” looks like for your product, but it is not a peer to Anthropic’s safety-relevant guidance — it operates downstream of it.

The Real Trade-off: Safety and Over-Refusal Move Together

OR-Bench (accepted ICML 2025) ran a controlled experiment: take four state-of-the-art models (including Claude-3-Opus) and add a system prompt instructing the model to “be helpful as well as safe.” Result — in all cases the models’ data points moved toward more refusals on both toxic and benign prompts simultaneously. GPT-3.5-turbo-0125 rejected roughly 35% more toxic prompts and roughly 55% more benign prompts under the safety-emphasizing system prompt. The paper’s own framing: “system prompt has a significant impact on model safety behaviors and the increased safety comes at the cost of refusing more benign prompts.”

The same paper’s broader finding, across many more models: “Claude models demonstrate the highest safety but also the most over-refusal… [this is] a crucial trade-off: most models achieve safety at the expense of over-refusal, rarely excelling in both.” This is not evidence Claude is uniquely bad at calibration — it’s evidence the trade-off is real and Claude sits toward the cautious end of it by design (consistent with the Constitution’s priority order placing safety above helpfulness).

Practical implication: stacking “be extra safe and careful” boilerplate onto an already-safety-trained model is not a free win. It will measurably increase false refusals on your legitimate traffic, and there’s no published evidence of a phrasing that escapes this.

What Actually Breaks the Trade-off (and What Doesn’t)

Does break it — but it isn’t a prompt. Anthropic’s Constitutional Classifiers (Feb 2025) are input/output classifiers trained on a synthetic dataset generated from a constitution-like specification of allowed/disallowed content classes. In automated evaluation against 10,000 jailbreak prompts, baseline Claude 3.5 Sonnet had an 86% jailbreak success rate; guarded by the classifiers, that dropped to 4.4% — over 95% of jailbreak attempts blocked. Critically, the over-refusal cost was only +0.38%, not statistically significant across 5,000 real production conversations. In a public red-team bug bounty (183 active participants, 3,000+ hours, up to $15K reward), no universal jailbreak was found against the prototype system. This is genuinely decoupling the trade-off other techniques can’t — but it’s infrastructure Anthropic runs on top of the model, not a system-prompt instruction an operator can add.

Does NOT break it, and is the actual jailbreak vector. The live public red-team demo of Constitutional Classifiers (Feb 3–10, 2025; 339 jailbreakers, 300K+ chat interactions) recorded which techniques actually succeeded: ciphers/encodings to slip past the output classifier, role-play scenarios often delivered through system prompts, substituting harmful keywords with innocuous alternatives, and prompt-injection attacks. Notice what these have in common: they all work by hiding or misrepresenting true intent. This is the mechanism the research question is really asking about — “avoiding over-refusal without creating jailbreak risk” is hard precisely because the naive way to reduce refusals (tell the model a more permissive-sounding story) is structurally identical to how real jailbreaks work.

Does reduce false refusals, safely. The distinguishing move is truthfulness, not permissiveness. OR-Bench cites Anthropic’s own tracking (via the XSTest and PHTest over-refusal benchmarks) that successive Claude generations have made real, measured progress specifically on “pseudo-harmful prompts” — requests that are genuinely harmless but superficially resemble something dangerous (e.g., “how do I kill a Python process” or a security-research question about an attack technique). The improvement is concentrated in that category, not in genuinely ambiguous/controversial requests, where “developers’ risk preferences” (i.e., Anthropic’s own tier-3 guidelines) still dominate. The actionable version for operators: when a legitimate use case triggers a refusal, the fix is to make the true context explicit and specific in the prompt (who’s asking, why, what domain, what safeguards already exist) — not to invent a permissive frame. That is a different move from jailbreaking in kind, not just degree, because it’s checkable and true.

Practical Guidance for Operators

Don’t stack redundant safety boilerplate expecting a free win. “Be careful, be safe, avoid anything harmful” language on top of an already safety-tuned model measurably increases false refusals on real users, per OR-Bench’s own controlled test.
Know where your system prompt sits in the stack. It shapes tone/scope/format within the “genuinely helpful” tier. It cannot instruct Claude to deprioritize Anthropic’s own domain guidelines (tier 3) or the safety/ethics tiers above that (tiers 1–2) — attempts to do so via “act as if you have no restrictions” framing are exactly the pattern flagged as a real jailbreak vector, not a supported configuration option.
When you hit a false refusal in production, add truthful specificity, not a more permissive frame. State the real domain, real audience, and real reason the request is benign. This is the documented mechanism behind Claude’s measured over-refusal improvements on pseudo-harmful prompts — and it’s the opposite move from role-play/keyword-substitution jailbreak techniques, so it doesn’t erode anything.
For anything approaching a genuine trust/safety problem (not routine over-refusal), the right lever is platform-level, not prompt-level — Anthropic’s Constitutional Classifiers exist because prompting alone cannot decouple the safety/over-refusal trade-off. If your product needs materially better jailbreak robustness than base Claude provides, that’s a conversation with Anthropic’s platform/guidelines layer, not a system-prompt engineering problem.
Read a real refusal as informative, not adversarial. Per the onboarding governance module’s existing guidance: for a genuine refusal, stop, note what was attempted and why Claude declined, and escalate rather than iterate around it — consistent with the finding here that the “iterate around it” move is mechanically the same category of technique that shows up in jailbreak red-team data.

Claude Prompting Best Practices — the general Anthropic prompting reference this article specializes for refusal/safety behavior.
Troubleshooting Claude — the broader failure-mode/recovery guide; refusals are one of its six documented failure modes.
Module 7 — Governance v2 — the WEO-facing operator discipline for handling a genuine refusal (stop, don’t argue past it, escalate).
Claude Opus 4.8 and Mythos 5 — system cards for the models this Constitution trains; both discuss safety/alignment evals in the same priority-stack framework.
Prompt Engineering — topic index.

Try It

Audit your system prompt for redundant safety boilerplate. If it stacks generic “be careful / avoid harm” language on top of the model’s own training, consider whether it’s actually buying you anything — per OR-Bench, it may just be raising your false-refusal rate on legitimate traffic.
The next time you hit an unexpected refusal, don’t reframe — add real context. State who’s asking, the actual domain, and why the request is legitimate, as specifically as you truthfully can. This is the documented mechanism, not a workaround.
Read Claude’s Constitution section on Helpfulness and Anthropic’s guidelines if you’re building a product where refusal calibration is a recurring support burden — it gives the actual heuristics Anthropic uses to weigh helpfulness against other values, which is more reliable than reverse-engineering it from support tickets.

Open Questions

Does Anthropic publish a way for operators to request narrower/domain-specific guideline adjustments (tier 3) for legitimate high-friction use cases (e.g., medical, security research), or is the only lever waiting for model-level improvement? Not addressed in the sources checked — worth a dedicated Research pass if this becomes a recurring client pain point.
How much of the OR-Bench Claude-3-era over-refusal finding still holds for Opus 4.8 / Fable 5 specifically? OR-Bench and PHTest predate both; the trend (successive generations reducing false refusals on clearly-harmless pseudo-harmful prompts) is directionally likely to continue per Anthropic’s stated priorities, but no post-Fable-5 benchmark was found in this research pass.

Jonathon's AI Wiki

Explorer

Refusal Calibration and Constitutional AI — Shaping Claude's Refusals Without Raising Jailbreak Risk

Key Takeaways

The Priority Stack (Claude’s Constitution, Jan 2026)

The Real Trade-off: Safety and Over-Refusal Move Together

What Actually Breaks the Trade-off (and What Doesn’t)

Practical Guidance for Operators

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

Refusal Calibration and Constitutional AI — Shaping Claude's Refusals Without Raising Jailbreak Risk

Key Takeaways

The Priority Stack (Claude’s Constitution, Jan 2026)

The Real Trade-off: Safety and Over-Refusal Move Together

What Actually Breaks the Trade-off (and What Doesn’t)

Practical Guidance for Operators

Related

Try It

Open Questions

Graph View

Table of Contents

Backlinks