Jonathon's AI Wiki

Tag: alignment

2 items with this tag.

  • May 08, 2026

    Translating Claude's Thoughts Into Language — Activation-to-Text Interpretability

    • interpretability
    • mechanistic-interpretability
    • activations
    • alignment
    • safety-evals
    • anthropic-research
    • blackmail-test
    • self-introspection
    • claude-mythos-preview
  • May 05, 2026

    Claude Mythos Preview — Anthropic System Card (April 7 2026)

    • claude-mythos-preview
    • system-card
    • anthropic
    • rsp-3-0
    • frontier-model
    • alignment
    • model-welfare
    • project-glasswing
    • cybersecurity
    • opus-4-6
    • defensive-cyber
    • not-generally-available

Created with Quartz v4.5.2 © 2026