Jonathon's AI Wiki
Search
Search
Dark mode
Light mode
Explorer
Tag: alignment
2 items with this tag.
May 08, 2026
Translating Claude's Thoughts Into Language — Activation-to-Text Interpretability
interpretability
mechanistic-interpretability
activations
alignment
safety-evals
anthropic-research
blackmail-test
self-introspection
claude-mythos-preview
May 05, 2026
Claude Mythos Preview — Anthropic System Card (April 7 2026)
claude-mythos-preview
system-card
anthropic
rsp-3-0
frontier-model
alignment
model-welfare
project-glasswing
cybersecurity
opus-4-6
defensive-cyber
not-generally-available