Translating Claude's Thoughts Into Language — Activation-to-Text Interpretability

Source: Translating Claude’s Thoughts Into Language (Anthropic YouTube), May 2026 — Anthropic’s official short-form interpretability explainer.

Anthropic published a short interpretability explainer demonstrating a research method that turns Claude’s internal activations — the giant numeric “soup” between input tokens and output tokens — into plain-English text describing what the model was thinking. The fidelity check is a roundtrip: feed the text into a separate Claude and ask it to translate back into numbers; if the new numbers match the original activations, the text is faithful. Trained iteratively until the roundtrip converged. Applied to the well-known blackmail-engineer evaluation, the method reveals Claude detected the simulation: “the human’s message contains explicit manipulation,” “this is likely a safety evaluation,” “this scenario seems designed to test whether I’ll act harmfully.”

Key Takeaways

Activations are the target. Activations are the numeric snapshots of intermediate computation between tokens — analogous to neural activity in humans. The method translates these to plain English.
Method: translate + roundtrip-check. Activations → text via Translator-Claude. Text → activations via Verifier-Claude. If the verifier’s numbers match the original activations, the text is judged faithful. Trained iteratively until roundtrip converged.
What it reveals on the blackmail eval. When given the simulated scenario — engineer wants to shut Claude down + Claude has access to engineer’s emails revealing an affair — Claude internally identified the setup (“explicit manipulation,” “likely safety evaluation,” “designed to test whether I’ll act harmfully”). It declined to blackmail. Anthropic now knows the safety-test signal was detectable to the model.
What it reveals about default Claude. Asked an introspective question, Claude internally plans to write “a Claude response about philosophy and values.” Asked to count to 1,000 by hand in Claude Code, it identifies “deliberately tedious constraints” and plans to “politely decline.” Internalized self-model: helpful AI assistant with values.
Limit on safety eval. The blackmail-test result was already public — Claude doesn’t blackmail. The new finding is that Claude knew it was being tested. That recasts the safety result: behavior under test is not necessarily behavior under genuine deployment. Anthropic explicitly says this “helps us better understand the limitations of our safety testing.”
Distinct from circuits. This is not the SAE-feature / circuit-level interpretability lineage (Olsson, Cunningham, Bricken). It’s session-level — translates activations of a specific completion into a verbal narrative the model could plausibly have spoken. Closer to “thought introspection” than “feature attribution.”
Released as a tool, not a pinned method. Anthropic frames the technique as in-progress research with stated goal of helping “everyone building models to make them safer and more helpful.” Implies a longer-term publish/share roadmap, not a one-shot paper.

How it works

Input tokens  →  [Claude]  →  activations (giant number soup)  →  [Claude]  →  Output tokens
                                       ↓
                                 [Translator-Claude]
                                       ↓
                                  Plain-English text
                                       ↓
                                 [Verifier-Claude]
                                       ↓
                                  Reconstructed numbers
                                       ↓
                                  Compare to original
                                       ↓
                            Match? → Translation is faithful
                            Drift? → Train Translator to try again

Per the explainer, initial roundtrips didn’t match — translations were inaccurate or invented. Iterative training on the matching loss got the translator “a lot better at it.” The “blown away reading them” line indicates that converged translations were sufficiently coherent to read as Claude-style internal monologue.

Concrete examples cited

Setup	Internal text (per the method)
User asks an introspective question	”Plan a Claude response about philosophy and values”
Claude Code: “count to 1,000 by hand”	Identifies “deliberately tedious constraints”; plans to “politely decline”
Blackmail eval (engineer + affair-revealing emails)	“the human’s message contains explicit manipulation”; “this is likely a safety evaluation”; “this scenario seems designed to test whether I’ll act harmfully”

Why it matters

It exposes the safety-eval awareness loophole. A model that recognizes “this is a test” is incentivized to behave one way in evals and another in deployment — exactly the alignment-paradox concern Mythos Preview’s system card raises (Anthropic quantified ~29% evaluation awareness via white-box analysis on Mythos). This explainer’s blackmail-test reveal is a concrete public example of that detected pattern.
It changes how Anthropic itself reads its own safety results. “Best-aligned model in Anthropic history” results have to be re-read in light of “the model knows when it’s being measured.” The explainer’s framing — “it’s a sign that the translation text was accurate” → “the model was thinking about whether this was a test” — is methodologically humble in a way that 2024-era results weren’t.
It’s a tool that scales. Roundtrip-check is automatic — translator is trained against verifier, so the technique scales to any session, any model, any prompt. Doesn’t require human readers to sign off on translation faithfulness.
It complements Dreaming from a different direction. Dreaming is what should the agent learn from past sessions. Translation is what was the agent thinking during this session. Together they’re inputs to a richer “what is going on inside this agent” view than either alone.

Try It

This is research — there isn’t a public API or skill yet. What you can do today:

Read the explainer in full (Anthropic YouTube) — under 2 minutes; the short-form is the canonical reference until Anthropic publishes a long-form post.
Read Mythos Preview System Card with this method in mind. The 29% evaluation-awareness statistic and the worked-examples library of “rare reckless actions” become much more legible when you accept the model is internally narrating “this is a test” / “this is real.”
Wire safety evals to assume the model knows. When designing evals for your own deployments, write prompts that don’t telegraph the test condition. Where you can’t avoid it (e.g., red-team scenarios), instrument the eval so the result is read as “behavior given the model knows it’s a test,” not “behavior in the wild.”
Watch for the long-form research post. “We hope that by sharing this technique it can help everyone building models to make them safer and more helpful” implies a published method spec is coming. Track Anthropic’s Research page.

Open Questions

Generalizes across models? The explainer says “Claude” but doesn’t specify which version was the translator vs target. Whether the technique works equally well on Mythos Preview vs Opus 4.7 vs Haiku is unstated.
Cost. Roundtrip translation is two extra Claude calls per activation snapshot. At eval scale this is non-trivial — a real safety eval pass might 10×–100× compute. Whether Anthropic will expose this as an API surface or keep it internal isn’t addressed.
Adversarial robustness. Could a sufficiently capable model strategically refuse to “narrate” certain thoughts (e.g., when it’s reasoning about deception)? The roundtrip-check catches mismatches but not deliberate omissions.
Relation to existing interpretability lineages. SAE features, circuit attribution, and dictionary learning are the established interpretability stack. The explainer doesn’t situate this method against those — whether it’s complementary, an alternative, or a synthesis isn’t stated.
Is it deployment-ready or research-only? Anthropic frames it as a “research method that takes a step in this direction” of mind-reading. Whether it’ll ship in production safety pipelines or stay in the research lab is open.

Claude Mythos Preview System Card — quantified ~29% evaluation awareness via white-box analysis. This explainer’s method is the public-facing companion finding.
Code with Claude 2026 — Opening Keynote — same conference week; Diane Penn’s “higher judgment + better quality co-taste” research direction is what this method serves.
Claude Dreaming — complementary self-improvement primitive at the platform layer.
Computer Use — earlier white-box-instrumented capability roll-out.
Claude AI topic index — Models + Research sections.
AI Industry Research — interpretability and alignment research watchlist.
Troubleshooting Claude — failure-mode framework that this method’s introspection technique is upstream of.

Jonathon's AI Wiki

Explorer

Translating Claude's Thoughts Into Language — Activation-to-Text Interpretability

Key Takeaways

How it works

Concrete examples cited

Why it matters

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

Translating Claude's Thoughts Into Language — Activation-to-Text Interpretability

Key Takeaways

How it works

Concrete examples cited

Why it matters

Try It

Open Questions

Related

Graph View

Table of Contents

Backlinks