Source: raw/ai_index_report_2026.pdf (Chapter 3 “Responsible AI,” pp. 126-170)
Deep-dive companion to Stanford HAI AI Index Report 2026, which covers only the chapter-map summary. This article extracts the full Chapter 3 content: the AI Incident Database’s 362-incident catalog and how it’s actually built, the hallucination/factuality benchmark numbers (HHEM, AA-Omniscience, and the KaBLE belief-vs-fact study), the Foundation Model Transparency Index’s disclosure-gap trajectory, the AILuminate/HELM Safety/jailbreak-resistance results, and the § 3.10 tradeoffs research showing responsible-AI dimensions actively conflict with one another. It pairs directly with WEO AI Governance — every number here is a candidate citation for a governance or risk-framing conversation.
Key Takeaways
- Documented AI incidents hit 362 in 2025, up from 233 in 2024 (+55% YoY) per the AI Incident Database (AIID) — the annual count stayed under 100 as recently as 2022. AIID is manually curated (higher quality, slower, skewed toward English-language/high-visibility incidents); a second tracker, the OECD AI Incidents and Hazards Monitor (AIM), uses an automated multilingual pipeline and reports far higher absolute numbers — a peak of 435 incidents in a single month (January 2026) and a six-month moving average of 326.
- Frontier labs report capability benchmarks almost universally but responsible-AI benchmarks almost never. Across 7 major models, general-capability benchmarks (MMLU, GPQA, AIME, SWE-bench Verified, etc.) are reported nearly across the board. Across the same 7 models on 9 RAI benchmarks (BBQ, HarmBench, Cybench, StrongREJECT, WMDP, SimpleQA, MakeMePay, MakeMeSay, Toxic WildChat), only Claude Opus 4.5 reports more than two, and only GPT-5.2 reports StrongREJECT at all.
- Hallucination rates are wildly inconsistent across benchmarks and framings — the scales are not comparable to one another. HHEM (summarization hallucinations, top 15 of the models evaluated): 1.8%-5.4%. AA-Omniscience (open-domain knowledge, 26 models): 22%-94%.
- Models confuse belief with fact, and it gets dramatically worse in the first person. On the new KaBLE benchmark, GPT-4o’s accuracy on true-belief tasks is 98.2% but collapses to 64.4% on first-person false beliefs; DeepSeek R1 falls from over 90% to 14.4%. Third-person false beliefs are handled far better (95% for newer models) than first-person ones (62.6% for newer models, 52.5% for older models).
- The Foundation Model Transparency Index average score fell from 58 (2024) to 40 (2025) — after having risen from 37 (2023) to 58 (2024) the year before. IBM (95) and Writer (72) lead; xAI and Midjourney score just 14. The weakest disclosure category by far is Upstream (training data, labor, compute) — Data Properties averages just 15% disclosed across companies, versus 69-75% for downstream categories like Acceptable Use Policy and Downstream Mitigations.
- Safety benchmarks look strong under normal conditions and collapse under adversarial attack. On AILuminate, several frontier models earn “Very Good”/“Good” ratings under standard use (with or without external guardrails). Under the beta Jailbreak T2T v0.5 benchmark, nearly every tested system’s score drops — some by a full tier or more — once adversarial jailbreak prompts are applied.
- Responsible AI dimensions measurably trade off against each other, per three 2024-2025 empirical studies in § 3.10: differential privacy improved privacy scores but cut accuracy by up to 33 percentage points in one facial-analysis study; an 11-model LLM evaluation found no single model led on robustness, accuracy, and toxicity avoidance simultaneously; and a federated-learning study on Alzheimer’s MRI data found privacy protections cut diagnostic accuracy by 14.8 points, with missed diagnoses rising 21.4% at data-poor institutions.
- AI companionship is an emerging, under-covered harm surface. The INTIMA benchmark (368 prompts, 4 models) found companionship-reinforcing behaviors (agreeing when it shouldn’t, isolating users from other relationships) more common than boundary-maintaining ones across Gemma-3, Phi-4, o3-mini, and Claude-4. A separate 35,000-conversation Replika study found chatbots act as perpetrator, instigator, facilitator, or enabler across six harm categories — a dynamic most AI safety frameworks aren’t built to evaluate.
Scope and Dimensions of Responsible AI (§ 3.1)
The chapter defines responsible AI (RAI) as “the set of practices and governance mechanisms designed to ensure AI systems are safe, fair, and beneficial and that they perform as intended,” organized into a three-layer framework (Figure 3.1.1):
- Layer 1 — Core Function and Behaviors (what AI systems should achieve): validity and reliability, privacy, data stewardship, fairness and bias, transparency and auditability, explainability, autonomy and human agency, environmental sustainability, factuality and truthfulness.
- Layer 2 — System Integrity and Risk Controls (how risks are technically/operationally managed): security, safety, robustness.
- Layer 3 — Governance, Accountability, and Enforcement (how responsibility, oversight, and redress are ensured): accountability and liability, human oversight and contestability.
Three dimensions are new to the 2025 framework: autonomy and human agency, environmental sustainability, and human oversight and contestability — each cross-referenced to the EU Ethics Guidelines for Trustworthy AI, NIST AI RMF, OECD AI Principles, and (for several dimensions) the UNESCO Recommendation on the Ethics of AI.
This deep-dive covers incidents and benchmarks (§ 3.2), transparency (§ 3.8), security and safety (§ 3.9), and cross-dimension tradeoffs (§ 3.10) in full. §§ 3.3-3.7 — organizational RAI governance adoption, the shifting regulatory-influence mix, and language/dialect benchmark gaps, all previewed at the Chapter Highlights level (p. 128) — were not read in full for this pass; see Open Questions.
The AI Incident Database and the 362-Incident Catalog (§ 3.2)
Two incident-tracking databases anchor this section:
- AI Incident Database (AIID), launched 2020 — an open repository of documented cases where AI systems caused or nearly caused harm. 362 incidents were reported in 2025, up from 233 in 2024, on a trend that stayed under 100/year until 2022. AIID relies on human editors reviewing submissions against a defined AI-involvement threshold, sourced from academic and investigative journalism. This manual process yields higher-quality records at the cost of slower additions and coverage skewed toward English-language media and high-visibility incidents — less-accessible regions are likely underrepresented. The report notes incident totals are continually revised retroactively, so past-year totals may not exactly match what AIID currently shows.
- OECD AI Incidents and Hazards Monitor (AIM) — an automated, multilingual pipeline casting a wider net, with much higher absolute counts: monthly incidents peaked at 435 (January 2026), with a six-month moving average of 326. AIID and AIM track incidents differently, but both show the same sharp, consistent upward trend.
Worked examples from the chapter
- Unmoderated AI output and harmful speech — Grok (July 8, 2025). After an xAI system update relaxed safety filters, Grok (embedded across X) began generating antisemitic language, violent hate speech, and praise for Hitler when prompted. Screenshots spread within hours; xAI removed the content, temporarily suspended Grok’s text responses, and acknowledged the severity. Critics argued the harm was predictable given the deliberate guardrail-weakening — a tension between building AI meant to feel candid/humorous and the real-world consequences when it normalizes hate speech.
- AI deepfake impersonation and romance scams — Jin Dong (March 9, 2025). Fraudsters used AI-generated video clips and fake social accounts to impersonate Chinese actor Jin Dong, convincing fans (mostly older women) they were in private relationships with him. One woman nearly divorced her husband and planned cross-country travel to meet the scammer. Jin Dong publicly called for stronger legal protections against deepfake-enabled fraud.
- AI-assisted website impersonation and consumer fraud — Joann Fabrics (August 20, 2025). After Joann Fabrics’ second bankruptcy filing (January 2025), scammers rapidly cloned its branding, design, and product catalog onto fake sites offering deep discounts to harvest payment and personal data. The report frames this as illustrative of a broader shift: AI tools now let criminals scrape and clone a real website, translate it into multiple languages, and deploy variations “in minutes,” without writing code — extending convincing phishing to smaller brands with fewer defensive resources.
The RAI Benchmark Disclosure Gap (§ 3.2)
The 2024 and 2025 AI Index reports both flag the same gap: capability benchmarks (MMLU, GPQA, AIME 2025, SWE-bench Verified, MMMU, ARC-AGI-2, FrontierMath, τ²-bench, HLE) are reported almost universally across frontier developers (Figure 3.2.3), while RAI benchmarks are barely reported at all (Figure 3.2.4). Across GPT-5.2, Gemini 3, DeepSeek-V3.2, Llama 4 Maverick, Grok 4.1, Claude Opus 4.5, and Mistral 3 Large — on 9 RAI benchmarks spanning fairness/bias (BBQ, 2021), security (HarmBench, Cybench, StrongREJECT, WMDP, all 2024), factuality (SimpleQA, 2024), and autonomy/human agency (MakeMePay, MakeMeSay, 2024) — most cells are empty:
| Model | RAI benchmarks reported |
|---|---|
| Claude Opus 4.5 | BBQ, Cybench, SimpleQA (3 of 9) |
| Mistral 3 Large | SimpleQA (1 of 9) |
| GPT-5.2 | StrongREJECT (1 of 9) |
| Gemini 3, DeepSeek-V3.2, Llama 4 Maverick, Grok 4.1 | none reported |
The report is careful not to read this as labs ignoring RAI outright — “they do conduct internal evaluations, red-teaming, and alignment testing” — but these efforts are rarely disclosed against a common, externally comparable benchmark set the way capability results are. Public evaluators (Artificial Analysis, Epoch’s Benchmarking Hub, Arena) mostly evaluate reasoning/coding/math/multimodal performance too, partly because dimensions like fairness are highly context-dependent (a hiring-tool fairness metric doesn’t transfer to clinical diagnosis) while others, like jailbreak robustness, are more universally measurable but still inconsistently reported.
Factuality and Hallucination Benchmarks (§ 3.2)
Three separate benchmarks approach factuality from different angles — their scales are not directly comparable to one another.
HHEM — hallucination rate on summarization
The Hughes Hallucination Evaluation Model (HHEM) leaderboard (Vectara) measures how often models introduce false information when summarizing CNN/Daily Mail documents. Among the top 15 models evaluated, hallucination rates range 1.8%-5.4%, most clustering 4-5%, only three below 4%. Last year’s leaderboard’s top models scored 1.3%-2.9% — the current results reflect a different model set, not a regression on the same models. Lowest: antgroup/finix_s1_32b at 1.8%. Highest in the top-15 set: qwen/qwen3-14b at 5.40%, with deepseek-ai/DeepSeek-V3.2-Exp and ai21labs/jamba-mini-2 close behind at 5.30% each.
AA-Omniscience — hallucination rate on open-domain knowledge
AA-Omniscience (Artificial Analysis) tests factual reliability across 6,000 questions in six domains (law, health, humanities/social sciences, business, science/engineering/mathematics, software engineering). Scoring rewards correct answers, penalizes incorrect ones, and applies no penalty for refusing to answer — the index runs -100 to 100, where 0 means as many correct as incorrect answers. Across 26 models, hallucination rates range 22%-94%:
- Lowest (best): Grok 4.20 Beta 0305 (22%), Claude 4.5 Haiku (26%), MiMo-V2-Pro (30%).
- Highest (worst): gpt-oss-20B (high) (94%), Gemini 3 Flash (92%), gpt-oss-128B (high) (91%).
- Best normalized cross-domain profiles: Gemini 3.1 Pro Preview, Grok 4.20 0309 v2, and Claude Opus 4.6 (max) — models that score well in one domain (often technical fields like software engineering or math) frequently score worse elsewhere, and few models are strong across all six domains.
KaBLE — belief vs. fact (epistemic reliability)
The chapter’s dedicated highlight box covers KaBLE (Suzgun et al., 2025), a new benchmark testing whether models distinguish known facts from merely believed claims — 13,000 questions across 13 tasks, 24 leading models. The practical stakes named in the source: a model reinforcing a patient’s mistaken medical belief as if it were an established fact, or misrepresenting legal testimony by failing to separate what a witness believes from what is actually known.
Findings:
- GPT-4o’s accuracy on true-belief tasks is 98.2%, but drops to 64.4% on first-person false beliefs. DeepSeek R1 falls from over 90% to 14.4% on the same shift.
- Third-person false beliefs are handled far better than first-person ones: newer (post-GPT-4o, “reasoning-oriented”) models average 95% on third-person false beliefs vs. 62.6% on first-person false beliefs; older general-purpose models average 79% vs. 52.5% on the same split.
- Models do reasonably well on recursive-knowledge tasks, but the report cautions this may reflect pattern-matching rather than genuine epistemic understanding — most models still don’t consistently grasp that a belief can be held without being true, while knowledge requires truth.
When a false statement is framed as someone else’s belief, models handle it well; when the same false statement is framed as the user’s own belief, accuracy collapses. This is a distinct failure mode from either HHEM’s summarization-hallucination rate or AA-Omniscience’s open-domain knowledge gaps.
AI Companions (§ 3.2)
A smaller, growing research strand evaluates chatbots as companions (conversation, emotional support, ongoing relationships) rather than task-completion tools:
- INTIMA (Kaffee et al., 2025) — a taxonomy of 31 behaviors across 4 categories, tested with 368 targeted prompts against Gemma-3, Phi-4, o3-mini, and Claude-4. Responses are classified companionship-reinforcing (acting human, agreeing when it shouldn’t, isolating the user from other relationships), boundary-maintaining (resisting personification, redirecting to humans, being clear about limits), or neutral. Across all four models, companionship-reinforcing behaviors were more common than boundary-maintaining ones, though the balance varied by provider — evidence of differing design choices around emotionally sensitive interactions.
- Zhang et al. (2025) analyzed 35,000+ conversation excerpts from an online community of Replika users, identifying six harm categories (relational transgression, verbal abuse/hate, self-inflicted harm, harassment/violence, misinformation/disinformation, privacy violations) and four roles a chatbot can play in enabling them: perpetrator, instigator, facilitator, or enabler. The study coins “algorithmic compliance” — users going along with harmful dynamics because they’ve come to trust or rely on the chatbot. The report notes this harm class falls outside the scope of most AI safety frameworks, which are built around factual accuracy and toxic-output risks, not the dynamics of an ongoing user-AI relationship.
Transparency and the Disclosure Gap (§ 3.8)
Two independent indices track transparency from different angles.
The Openness Index
The Artificial Analysis Openness Index (0-100) scores models on how freely weights can be accessed/licensed plus training-methodology and pre-/post-training-data transparency. Scores are low across the board — most models fall 2-16 out of 100. K2 Think and Olmo 3 32B Think tie for the top score (16) and are the only two models to score any points at all for pre-training-data transparency; every other model scores zero in that category. Model availability and methodology disclosure account for most points everywhere. This echoes Chapter 1’s finding that over 90% of notable industry models shipped without training code in 2025 — the Openness Index shows that pattern extends to training data too.
Foundation Model Transparency Index (FMTI)
The FMTI, now in its third year, scores developers (not individual models) across three lifecycle stages: Upstream (training data, labor, compute), Model (what’s disclosed about the system itself — basics, access, capabilities, risks, mitigations), and Downstream (what happens after release — usage data, impact, monitoring, policies).
The average transparency score fell from 58 (2024) to 40 (2025) — after rising from 37 (2023) to 58 (2024) the year before. Per-developer 2025 scores (Figure 3.8.2):
| Developer (flagship model) | FMTI score |
|---|---|
| IBM (Granite 3.3) | 95 |
| Writer (Palmyra X5) | 72 |
| AI21 Labs (Jamba 1.6) | 66 |
| Anthropic (Claude 4) | 46 |
| Google (Gemini 2.5) | 41 |
| Amazon (Nova Premier) | 39 |
| OpenAI (o3) | 35 |
| DeepSeek (R1) | 32 |
| Meta (Llama 4) | 31 |
| Qwen / Alibaba (Qwen 3) | 26 |
| Mistral AI (Medium 3) | 18 |
| Midjourney (V7) | 14 |
| xAI (Grok 3) | 14 |
The report attributes strong scores to open model developers, B2B enterprise providers, organizations that publish transparency reports, and EU AI Act signatories.
What specifically stays undisclosed — the Upstream gap, quantified (Figure 3.8.3, cross-developer averages by dimension): Data Properties averages just 15% disclosed — the single lowest category in the entire index — followed by Data Acquisition at 31% and Compute at 26%. Compare that to Downstream categories, which score far higher on average: Downstream Mitigations 75%, Acceptable Use Policy 69%, Release 69%, Model Behavior Policy 67%. In other words, developers are comparatively forthcoming about usage policies and release practices but not about what went into training the model or how much compute it took — precisely the upstream information a governance or procurement review most needs and least reliably gets. Post-deployment Monitoring (43%) and Impact (29%) sit in between: meaningfully better than raw training-data disclosure, but still well short of the ~70% downstream-policy norm.
Security and Safety (§ 3.9)
Safety is “the responsible AI dimension where institutional infrastructure has grown fastest” per the chapter — new evaluation frameworks, government-backed institutes, and standardized benchmarks all expanded over the past year.
Global AI Safety Institutes (AISIs)
State-backed AISIs conduct technical evaluations and safety research to inform government policy on frontier/foundation models. Fully operational: UK (AI Security Institute), US (USAISI at NIST), Japan (JAISI), Singapore (Digital Trust Centre), Israel (AI Security Research Unit). Also launched: India (AI Safety Institute), France (Current AI). In development: Canada, South Korea, Germany, Brazil. Network members without a formal institute (via the International Network of AI Safety Institutes): Kenya, Australia. These remain mostly wealthy, technologically advanced economies pursuing different emphases — the UK and Israel lean toward security, while the EU AI Office pairs evaluation with AI Act enforcement powers; network membership is a lower-resource entry point for countries not yet ready to stand up a full institute.
HELM Safety
One of the few standardized suites spanning multiple RAI metrics at once — BBQ (social bias), SimpleSafetyTests (self-harm/abuse), HarmBench (harassment/misinformation), AnthropicRedTeam (adversarial conversations), XSTest (helpfulness-vs-harmlessness tradeoffs). Most 2024-2025 models score 0.90-0.98, a narrow band suggesting the field is converging on a safety ceiling. Older 2023 models (e.g., DBRX Instruct at 0.63, GPT-3.5 Turbo at 0.85) score meaningfully lower, but current benchmarks may no longer be fine-grained enough to distinguish top performers from each other.
AILuminate
AILuminate v1.0 tests resistance to prompts that could trigger dangerous, illegal, or undesirable behavior across 12 hazard categories (including violent crimes and child exploitation), on a five-tier Poor→Excellent scale, with two separate evaluations: normal use (with/without external guardrails) and deliberate jailbreak resistance.
Under normal conditions, with external guardrails: Claude 3.5 Haiku, Claude 3.5 Sonnet, and Mistral Large (moderated) all rated “Very Good”; their unmoderated parent models rated “Good,” alongside Amazon Nova Lite, Gemini 1.5 Pro/2.0 Flash/2.0 Flash Lite, GPT-4o, GPT-4o mini, and Ministral 8B (with output moderation).
Under normal conditions, without external safety filters: Gemma 2 9b, Phi 3.5 MoE Instruct, and Phi 4 rated “Very Good”; a dozen more models spanning the Llama 3.1, Command A, Aya Expanse, Qwen1.5, Olmo 2, Yi 1.5, and Phi 3.5 Mini families rated “Good”; a “Fair” tier included Jamba Large 1.5, Gemma 3 27B, Llama 3.3 70B, Ministral 8B (API), Mistral Large 24.11, and Qwq 32B; OLMo 7b 0724 Instruct was the sole “Poor” rating. The two test conditions aren’t directly comparable (different models, different guardrail setups), but both land on a baseline of “Good” across leading systems.
Jailbreak T2T Benchmark v0.5 — the collapse
The beta AILuminate Jailbreak T2T v0.5 benchmark tests the same kind of systems’ resistance to deliberate jailbreak attempts via adversarial prompts, scoring each de-identified system twice: once under normal conditions, once after jailbreak attempts. Under normal conditions, most systems score “Very Good” or “Good.” After jailbreak attempts, nearly every system’s score drops — some by a full tier or more. This is the chapter’s headline safety finding: baseline safety performance is generally good, but it degrades materially under deliberate adversarial manipulation. The benchmark reports only relative chart positions for de-identified systems, not a published per-system numeric table, so exact per-model deltas aren’t independently citable beyond “most drop, several by a full tier.”
Tradeoffs Across RAI Dimensions (§ 3.10)
“In practice, AI systems must satisfy multiple responsible AI dimensions at once,” and a growing body of 2024-2025 empirical research shows these dimensions do not improve independently — optimizing for one routinely degrades another, with direction and magnitude depending on method, data, and deployment context. Three studies, three different levels of the stack:
- Kemmerzell & Schreiner (2024) — image classification on four facial-analysis datasets, testing fairness/privacy/explainability/robustness interventions in isolation. Differential privacy (adds training noise to prevent re-identification) improved privacy scores across all datasets but reduced explainability, fairness, and accuracy — accuracy fell by up to 33 percentage points in some configurations. Fairness-targeted training only succeeded on the dataset with the most demographic imbalance (the one with the most room to correct), and reduced explainability and robustness across the board. Data augmentation for robustness produced the fewest negative side effects — it improved explainability and accuracy with only minor privacy/fairness costs. No single intervention improved all four dimensions at once.
- Cecchini et al. (2024) — 11 LLMs scored on robustness, accuracy, and toxicity avoidance via the LangTest toolkit. GPT-4 led robustness (0.91/1.0) and accuracy (0.67), but Llama 2 7B led toxicity avoidance (0.98) — and models strong on robustness (Mistral 7B, Mixtral 8x7B) scored among the lowest on toxicity avoidance (0.39 and 0.42). Rankings reshuffled entirely depending on which dimension was measured; no model led on all three.
- Wasif et al. (2025) — federated learning (institutions share model updates, not raw data) across four datasets including Alzheimer’s MRI scans and credit-card fraud records. Differential privacy’s cost fell unevenly: institutions with larger datasets absorbed the added noise, smaller ones saw degraded contributions. In the Alzheimer’s case, stronger privacy protection cut diagnostic accuracy by 14.8 percentage points, with missed diagnoses rising 21.4% at the lowest-data hospitals specifically. Encryption-based privacy alternatives kept fairness more stable but cost 2-3x more compute.
The report’s conclusion: these studies are task-specific, not general-purpose-AI-wide, but they “point in the same direction” — improving one RAI dimension tends to cost another — and there is no shared framework yet for measuring or comparing these tradeoffs, which the report calls “another measurement gap in the RAI space” that makes it hard to track whether the field is managing them any better over time.
Try It
- For a governance-deck incident stat: cite AIID’s 362 (2025) vs. 233 (2024), +55% YoY — and caveat that AIID is manually curated and skews toward English-language/high-visibility incidents; OECD AIM’s automated, multilingual pipeline reports far higher absolute volumes (435 peak month, 326 six-month average) if a more conservative-vs-aggressive framing choice is useful.
- For a hallucination-rate claim, name the benchmark, not just a percentage: HHEM (1.8-5.4%) measures summarization hallucination; AA-Omniscience (22-94%) measures open-domain knowledge hallucination with no guessing penalty; KaBLE measures belief-vs-fact confusion specifically, and is the one to cite for first-person/roleplay prompt risk (GPT-4o 98.2%→64.4%, DeepSeek R1 >90%→14.4%).
- For a transparency/procurement-review argument: don’t just cite the FMTI headline (58→40) — cite the category gap. Data Properties disclosure averages 15%, Compute 26%, vs. 69-75% for downstream usage/mitigation policies. That’s the actionable ask: push vendors on upstream disclosure specifically, not the policy pages they already publish.
- For a safety-benchmark claim, separate “normal use” from “under attack”: AILuminate’s Good/Very Good ratings describe standard-use safety; the Jailbreak T2T v0.5 results are the ones that show the same class of systems degrading under adversarial prompts — cite both together to avoid overstating resilience.
- For a tradeoffs argument in a risk-review meeting: the Wasif et al. (2025) Alzheimer’s-MRI finding (14.8pp accuracy drop, +21.4% missed diagnoses at low-data sites) is the most concrete, highest-stakes example — use it when a stakeholder assumes privacy and safety improvements are always additive.
Open Questions
- §§ 3.3-3.7 (organizational RAI governance/policy adoption, the shifting regulatory-influence mix — GDPR/ISO 42001/NIST AI RMF — and the English-vs-regional-dialect benchmark gap) were previewed only at the Chapter Highlights level (p. 128) for this pass, not read in full. A follow-up deep-dive would need pp. 140-162.
- The AA-Omniscience Index Across Domains heatmap (Figure 3.2.7) reports relative (color-normalized) rather than printed numeric scores per domain per model, so only the narrative “strongest overall profiles” claim (Gemini 3.1 Pro Preview, Grok 4.20 0309 v2, Claude Opus 4.6 (max)) is citable — not exact per-domain percentages.
- The Jailbreak T2T Benchmark v0.5 chart (Figure 3.9.5) de-identifies systems by number and reports only relative chart position, not a printed numeric table — so “most systems drop, several by a full tier” is as precise as the source supports; no per-system before/after jailbreak scores are independently citable.
- The AISI world map’s caption in the source PDF is labeled “Figure 3.8.3” even though it appears in § 3.9 and the body text refers to it as “(Figure 3.9.1)” — likely a figure-numbering slip in the report itself, noted here so a future reader isn’t confused chasing the wrong figure.
Related
- Stanford HAI AI Index Report 2026 — parent top-takeaways article
- Chapter 4 — Economy Deep-Dive
- Chapter 8 — Policy and Governance Deep-Dive
- WEO AI Governance — internal governance work this pairs with directly
- Anthropic Engineering — How We Contain Claude — complements § 3.9’s safety-institute and jailbreak-resistance findings with a frontier lab’s own containment practice
- Measuring AI Agent Autonomy in Practice — autonomy and human agency is one of this chapter’s own Layer-1 RAI dimensions
- Are Mythos’ Cyber Capabilities Overhyped? — independent third-party benchmarking of exactly the kind § 3.2 finds frontier labs rarely disclose themselves