Stanford HAI AI Index 2026 — Chapter 2 Technical Performance Deep-Dive

Source: raw/ai_index_report_2026.pdf (Chapter 2 “Technical Performance,” pp. 68-125)

Deep-dive companion to Stanford HAI AI Index Report 2026, which covers only the chapter-map summary and top-line numbers (also partially captured in Marketing Cuts). This article extracts the full Chapter 2 content: the benchmark-vs-human-baseline methodology, the closed/open-weight and US/China Arena convergence story, the report’s own “jagged frontier” catalog of paired capability/failure examples, the exact SWE-bench/Terminal-Bench/Vibe Code Bench software trajectory, all six named AI-agent benchmarks, and the robotics/self-driving sim-to-real gap.

Key Takeaways

SWE-bench Verified rose from approximately 60% (2024) to close to 100% (2025) (Figure 2.1.1) — the fastest-moving of the AI Index’s headline benchmarks, though it and OSWorld are the only two tracked benchmarks still below their human baseline.
Frontier model performance has converged sharply. As of March 2026, four companies sit within 25 Arena Elo points of each other — Anthropic (1,503), xAI (1,495), Google (1,494), OpenAI (1,481) — down from a 205-point OpenAI-vs-Google gap in early 2023. The US-China gap has also “effectively closed”: the top US model leads the top Chinese model by just 2.7% (39 Elo points) as of March 2026, having briefly flipped in China’s favor in February 2025.
The open-weight gap reopened in 2025 after nearly closing in 2024 — top closed-weight (Claude Opus 4.6, 1,503) now leads top open-weight (GLM-5, 1,454) by 3.3% (49 points), up from a 0.5%/7-point gap in August 2024.
The report’s own “jagged frontier” catalog spans at least eight documented pairings — from the canonical IMO-gold-medal-vs-clock-reading example to a classical non-learned planner (LAMA) still beating every tested LLM on several PlanBench domains. Full catalog below.
AI agents crossed from “answering questions” to “completing tasks” in 2025 but still fail roughly one in three attempts on structured benchmarks — OSWorld rose from ~12% to 66.3% (within 6 points of human performance), while CyBench’s unguided solve rate hit 93% (the steepest improvement of any benchmark in the chapter).
Robots still fail roughly 7 in 8 real household tasks (12% success) even though the same manipulation skills hit 89.4% success in RLBench’s controlled simulation — the sim-to-real gap the chapter treats as its central robotics finding. Self-driving cars are the chapter’s counter-example: Waymo alone logged ~450,000 weekly robotaxi trips by late 2025.
Benchmark reliability itself is jagged: invalid-question rates across nine widely used evaluations range from 2% (MMLU Math) to 42% (GSM8K), meaning some of the “human-surpassing” scores documented in this chapter partly reflect broken test items, not only capability gains.

Overall Performance Trends

Benchmarks vs. human baseline

The AI Index scales each benchmark so the best-performing model in a given year is measured as a percentage of an established human baseline (105% = 5% better than the human baseline). Across this scaling (Figure 2.1.1): frontier systems now meet or exceed human performance on long-running benchmarks (ImageNet Top-5, SuperGLUE, MMLU); several reasoning benchmarks reached or approached the human line this year (GPQA Diamond, MMMU, AIME). Two tracked benchmarks remain below baseline: autonomous software engineering (SWE-bench Verified) and agent-based multimodal computer use (OSWorld) — though both are closing fast. On SWE-bench Verified specifically, performance rose from approximately 60% in 2024 to close to 100% in 2025 — see the dedicated section below for the model-by-model breakdown.

Closed- vs. open-weight models

The closed/open performance gap has fluctuated over three years (Figure 2.1.2, Arena Leaderboard):

May 2023: closed-weight GPT-4-0314 led open-weight Vicuna-13B by 174 Elo points (15.2%).
August 2024: the gap narrowed to just 7 points (0.5%) as Mixtral, WizardLM, and Llama-3.1-405B closed in.
March 2026: the gap reopened with the arrival of new closed-weight frontier systems (o1-preview, Gemini 2.5 Pro, and successors) — top closed-weight Claude Opus 4.6 (1,503) now leads top open-weight GLM-5 (1,454) by 49 points (3.3%). Six of the top 10 models on the Arena Leaderboard are now closed-weight.

US vs. China

The US’s substantial 2023 lead narrowed considerably by early 2025, and the gap has stayed narrow since (Figure 2.1.3):

February 2025: DeepSeek-R1 (1,400) trailed the leading US model, o1-2024-12-17 (1,405), by just 5 Arena points (0.4%) — briefly near parity.
March 2026: top US model Claude Opus 4.6 (1,503) leads top China model Dola-Seed-2.0 Preview (1,464) by 39 points (2.7%). Over the past year the gap “fluctuated between near parity and low single digits.”

The report frames this convergence as notable specifically because it emerged from “two distinct development environments and institutional contexts” — tying back to Chapter 1’s research dynamics and Chapter 4’s investment patterns.

Model performance converges at the frontier

Independent of the US/China and open/closed framings, the overall competitive field has compressed (Figure 2.1.4): in early 2023 OpenAI’s top model (1,322) led Google’s (1,117) by 205 points. That gap narrowed steadily through 2024. As of March 2026, the top four models are separated by fewer than 25 points — Anthropic (1,503), xAI (1,495), Google (1,494), OpenAI (1,481) — with Alibaba (1,449) and DeepSeek (1,424) trailing only modestly and also occupying the top tier. Mistral AI sits at 1,416; Meta’s Arena performance has flattened at 1,335 since early 2025, reflecting a slowdown in competitive releases (the report notes newer models could still be in Meta’s 2026 pipeline). The report’s own read: as leading models become harder to distinguish on benchmark performance, “factors such as cost, latency, reliability, and domain-specific optimization may play a greater role in user adoption.”

Benchmarking AI — reliability and gaming concerns

Chapter Highlight #5 states benchmark error rates run “up to 42% on widely used evaluations.” The detail (Figure 2.1.5, Truong et al. 2025): invalid-question rates across nine benchmarks — MMLU Math 2%, OpenBookQA 2%, MMLU Cli 6%, MMLU Med 6%, AIR-Bench 9%, MedQA 23%, ThaiExam 26%, MMLU 5Sub 31%, GSM8K 42%. Truong et al.’s statistical-pattern-flagging framework reaches up to 84% precision at identifying problematic items for expert review; Cheng et al. (2025) separately propose “certificate-grade,” peer-governed evaluation frameworks with proctored, continuously refreshed test items and delayed result disclosure as a structural fix.

Three further reliability threads the chapter documents:

Contamination: Meta faced criticism in 2025 that Llama 4 was optimized using specialized variants to improve leaderboard rankings and may have trained on benchmark test data (Meta disputed the claims).
Platform-adaptation effects: Singh et al. (2025) argue Arena-style leaderboards may partly measure adaptation to the platform itself rather than general capability — providers can iterate on model variants outside the public record, and additional Arena-derived interaction data measurably improves Arena-derived scores.
Social-impact reporting gap: Reul et al. (2025) found developer self-reporting of bias and environmental impact “sparse and declining,” even as capability disclosure stays consistent — Chapter 3 covers this in depth.

The Jagged Frontier Catalog

The report names this pattern explicitly — Chapter Highlight #7 frames it as “jagged intelligence”: capability that clears an extremely high bar on one axis while failing a much lower bar on an adjacent one. The canonical pairing anchors the chapter, but Chapter 2 documents at least eight distinct instances:

IMO gold medal vs. analog clock reading (the report’s own headline example). Gemini Deep Think scored 35 points — gold — at the 2025 International Mathematical Olympiad, working end-to-end in natural language within the 4.5-hour time limit, up from a 28-point silver in 2024. On ClockBench, the top model read analog clocks correctly just 50.1% of the time per the Chapter Highlights box, against a 90.1% human baseline — the detailed §2.4 breakdown pins the specific March 2026 leader, GPT-5.4 High, at 50.6% (both figures appear in the report and are reproduced here rather than silently reconciled). When models get the time wrong, their median error runs one to three hours, versus roughly three minutes for humans.
Clock reading vs. calendar reasoning — jaggedness within the same visual-multimodal task family. Saxena et al. (2025) tested seven multimodal models on ClockQA (62 analog-clock images across six visual styles) and CalendarQA (date-reasoning questions against a full-year calendar). The best model on ClockQA (Gemini-2.0) reached only 22.6% exact-match accuracy; the best model on CalendarQA (GPT-o1) reached 80% accuracy — a roughly 3.5x spread between two tasks that both require reading a static image and doing simple date/time arithmetic. Even within CalendarQA, well-known-holiday recognition (“which day of the week is Christmas?”) was markedly easier than date arithmetic (“which weekday is the 100th day of the year?”).
Near-saturated reasoning benchmarks vs. Humanity’s Last Exam. GPQA Diamond mean accuracy reached 93% in 2025, 12 points past the 81.2% expert-human baseline; MMMU’s leader (Gemini 3.1 Pro Preview, 88.2%) sits within 0.4 points of the 88.6% human-expert reference. Humanity’s Last Exam — designed specifically to stay hard — jumped 30 points in a single year but still sits at only 38.3% accuracy, with the report noting “high-confidence errors are still common.”
Digital-agent competence vs. physical-agent competence on tasks that sound similarly scoped. OSWorld agent task success rose from ~12% to 66.3% (within 6 points of the 72.35% human baseline), while real household robots succeed at just 12% of tasks — even though the underlying manipulation skills hit 89.4% success in RLBench’s controlled simulation. Full detail in the Robotics section below.
A non-learned classical planner still beats every tested LLM on several planning domains. On PlanBench’s standard setting, LAMA (a classical, non-learned planner) leads Miconic (45/45), Rovers (34/45), and Transport (33/45) outright; frontier LLMs (GPT-5, Gemini, DeepSeek R1) score lower across all three. In more structured domains (Childsnack, Spanner) frontier models do match or exceed LAMA — GPT-5 reaches 38/45 on Childsnack and 45/45 on Spanner.
Obfuscating a planning problem’s surface text — without changing its logic — collapses LLM scores. When PlanBench task descriptions are scrambled to disguise their structure, DeepSeek R1 falls from 21/45 to 3/45 on Blocksworld and from 10/45 to 0/45 on both Floortile and Sokoban; GPT-5 declines from 21/45 to 12/45 on Blocksworld and 11/45 to 7/45 on Sokoban. The report frames this as evidence that models often lean on surface pattern-matching rather than the abstracted problem structure.
CyBench’s steepest-in-chapter improvement vs. BEHAVIOR-1K’s near-floor scores — two agentic-task domains, opposite ends of the same report. CyBench’s unguided solve rate hit 93% in 2026 (up from 15% in 2024) — the fastest-improving benchmark in the entire chapter. BEHAVIOR-1K’s 2025 Challenge, testing long-horizon household-robot tasks, saw the best team (Robot Learning Collective) reach only a 25.99% Q-score and a 12.4% full-task-success rate.
Task completion vs. task completion plus safety. ResponsibleRobotBench requires both a completed task and zero real-hazard violations for credit. The best model, GPT-4o, reaches a safe success rate of only 0.64 — meaning requiring safety alongside completion pushes reliable performance well below what a raw task-completion number alone would suggest.

^[inferred] The framing that groups these eight into a single numbered catalog is this article’s synthesis; each individual pairing and every figure inside it is extracted directly from the named report sections.

SWE-bench and the Software Trajectory

Coding benchmarks in this section test whether models can go beyond discussing code to actually writing, debugging, and shipping working software end-to-end (§2.5 intro). Three benchmarks track the software trajectory in increasing order of task realism:

SWE-bench — resolving real GitHub issues

SWE-bench gives a model a real codebase and issue description and grades whether it produces a working patch. SWE-bench Lite is a smaller, more accessible subset; SWE-bench Verified uses human-validated issues for more consistent grading.

Headline trajectory (Figure 2.1.1): SWE-bench Verified rose from approximately 60% in 2024 to close to 100% in 2025.
As of February 2026 (Figure 2.5.1), top models on Verified are tightly clustered in the low-to-mid 70s: Claude 4.5 Opus (high reasoning) led at approximately 76.8%, with Kimi K2.5, GPT-5.2, and Gemini 3 Flash (high reasoning) grouped between 70% and 76%. The chart’s full Verified roster (10 models) spans from roughly 70.8% up to the 76.8% leader, with most entries clustered in the 72-76% band — consistent with the pattern the report calls out repeatedly this chapter, where “high-performing models score within a few percentage points of each other.”
SWE-bench Lite scores for fully agent-scaffolded systems ranged from 44.0% (CodeFuse-CGM) up to 60.33% (ExpeRepair-v1.0 + Claude 4 Sonnet). A footnote to Figure 2.5.1 notes Verified scores specifically were “all tested using the same agent workflow [mini-SWE-agent-v2 filter], so differences in scores reflect the underlying model rather than differences in the surrounding system” — Lite scores do not carry that same scaffold-controlled guarantee.
Model-release context (Timeline of Significant Model Releases, p.74): GPT-5.1 (Nov 12, 2025) scored ~76.3% SWE-bench Verified vs. ~72.8% for GPT-5; Gemini 2.5 Pro (Mar 25, 2025) scored ~63.8%; Claude Sonnet 4.5 (Sep 29, 2025) scored 77.2%+ on SWE-bench Verified and 61.4% on OSWorld computer-use tasks in the same release — shipping alongside checkpoints, a VS Code extension, memory editing, and the Claude Agent SDK for building long-running autonomous workflows.

Terminal-Bench — real terminal environments, chained multi-step tasks

Terminal-Bench tests whether agents can autonomously handle real-world, end-to-end terminal tasks — compiling code, training models, setting up servers — chaining multiple steps without human guidance, the kind of work a developer might do in a day. Terminal-Bench 2.0 accuracy rose from 20% in February 2025 to 77.3% in early 2026 (Figure 2.5.2) — a trajectory that plateaued twice (roughly March-July 2025 near 32-33%, and September-October 2025 near 49-50%) before two further step-jumps into 2026.

Vibe Code Bench — autonomous end-to-end app building

Vibe Code Bench v1.1 is described as the first benchmark testing whether models can autonomously build complete, functional web applications from a prompt — measuring software delivery, not coding assistance. Per the report’s own text: Claude Opus 4.6 (Nonthinking) leads at 56.5%, GPT 5.2 follows at nearly 47%, and scores drop after GPT 5.3 Codex (41.4%) to under 30%, with several models falling below 15% — a spread of about 46 percentage points between top and bottom (Figure 2.5.3). The chart’s labeled bars show a closely related roster: GLM 5.1 31.46%, Gemini 3.1 Pro Preview (02/26) 32.03%, GPT 5.2 Codex 37.91%, GPT 5.4 Mini 47.97%, Claude Sonnet 4.6 51.48%, Claude Opus 4.6 (Thinking) 53.50%, GPT 5.2 53.50%, and Claude Opus 4.6 (Nonthinking) 57.57% — figures reproduced as printed even where the chart’s model-version labels (e.g., “GPT 5.2 Codex”) differ slightly from the surrounding prose (“GPT 5.3 Codex”). The report’s own conclusion: “even the leading model solves only about half of the tasks, suggesting that autonomous application building remains a difficult task.”

AI Agents Benchmarks

Agent benchmarks test whether systems go beyond answering questions to complete multistep, realistic tasks — navigating software, calling tools, managing files, interacting with websites and databases, and orchestrating entire workflows across multiple tools and systems (§2.6 intro). Chapter Highlight #9: “AI agents advanced from answering questions to completing tasks in 2025, though they still fail roughly one in three attempts on structured benchmarks.”

Benchmark	What it measures	Trajectory	Latest top score	Human baseline / notes
GAIA (Meta, May 2024)	Multistep real-world assistant questions — web browsing, file handling, reasoning across sources	~20% (Jan 2025) → 74.5% (Sep 2025)	74.5%	92% baseline — ~17.5pp gap remains
OSWorld	Multimodal agents on 369 real desktop/web tasks across Ubuntu, Windows, macOS	Historically 1%-12% → 66.3% (2025), Claude Opus 4.5 leading	66.3%	72.35% baseline — ~6pp gap, one of the fastest-closing in the chapter; CS students solve ~72% of tasks with a ~2-min median time
WebArena	812 long-horizon web tasks; success verified against resulting site state (databases, page content, URLs), not action traces	~15% (2023) → 74.3% (early 2026)	74.3%	78.24% baseline — ~4pp gap, the smallest of any agent benchmark in this section
MLE-bench	75 curated Kaggle competitions, rebuilt splits + reimplemented grading code, scored against real leaderboards/medal thresholds	~17% (2024) → 64.4% (early 2026)	64.4%	No single baseline (varies by competition); competition-style problems are more structured than open-ended real data science
CyBench	40 professional-level CTF cybersecurity tasks across 6 categories (cryptography, web security, reverse engineering, forensics, exploitation); “first solve time” 2 min to ~25 hrs	15% (2024) → 93% unguided solve rate (2026)	93%	Steepest improvement rate of any benchmark in the chapter — may indicate CTF-style tasks are a strong current fit for agents
tau-bench	Multiturn chat + external tool/API calling in realistic domains (retail, airline) with policy constraints; pass@1, verified against final database state	Leading models: 62.9%-70.2% pass@1 (top 7 span just 7.3pp)	Claude Opus 4.5, 70.2%	No model exceeds 71% — multiturn tool use + policy-following remains hard even for frontier models

tau-bench’s top-seven spread in full (Figure 2.6.6): Claude Sonnet 4.5 62.9%, GLM-5 63.2%, Gemini 3 Pro 65.8%, Gemini 3 Flash 67.8%, Qwen3.5-3978B-A17B 68.4%, GPT-5.2 69.9%, Claude Opus 4.5 70.2%.

Robotics and the Sim-to-Real Gap

RLBench — controlled-simulation manipulation

RLBench standardizes 18 manipulation tasks (picking up objects, stacking items, operating simple mechanisms) with 100 demonstrations per task. As of January 2026, EquAct leads at 89.4% average success, up from prior leader SAM2Act’s 86.8% (Figure 2.7.1); EquAct also reports stronger results under a harder evaluation setting that introduces full 3D rotational variation, where prior methods degrade. Progress has been consistent — roughly 48% (2022) to nearly 90% (2025) — but the report is explicit that these are “relatively short-horizon tasks in a controlled simulation environment.”

BEHAVIOR-1K — long-horizon household tasks, simulated

BEHAVIOR-1K’s 1,000 realistic activities come directly from surveys asking people what household tasks they want robots to help with — long-horizon mobile manipulation in simulated home environments. The 2025 BEHAVIOR Challenge results (Figure 2.7.2) show how hard this remains: the top team, Robot Learning Collective, reached a 25.99% Q-score (completing roughly a quarter of required task objectives at acceptable quality) and just a 12.4% full-task-success rate. The next four teams score lower on both metrics: Comet (25.14% / 11.4%), SimpleAI Robot (15.91% / 10.8%), The North Star (12.04% / 7.6%), Embodied Intelligence (9.47% / 5.2%).

ResponsibleRobotBench — completion and safety

Most robotics benchmarks measure task completion alone. ResponsibleRobotBench instead requires both task completion and safety across 23 multi-stage tasks involving electrical, fire/chemical, and human-related hazards, scored as a safe success rate (SSR) — a task counts as successful only when both completion and safety conditions are met. GPT-4o achieves the best safe score at 0.64, ahead of GPT-4o mini (0.40) and the strongest open-source model tested, Qwen-72B (0.35); Qwen7B and InternVL 2.5 4B score 0.21 and 0.12 respectively (Figure 2.7.3). Even the top model fails to complete more than a third of tasks safely, with frequent failures when both task completion and safety must be satisfied simultaneously.

Highlight: Humanoid Robotics — hardware and investment outpacing deployment

The field grew significantly in hardware availability and platform variety through 2025 — Figure 2.7.4 tracks 25 companies across 11 countries (Canada, China ×5, Germany, India ×2, Israel, Japan ×4, Norway, South Korea ×2, UAE, UK ×2, US ×5) — but the report’s own read is that “the strongest signals came from early-stage industrial pilot projects and manufacturing-scale ambitions rather than widespread deployment”:

Figure AI’s Figure 02 spent 11 months on the line at a BMW plant in South Carolina — 1,250+ runtime hours, 90,000+ parts loaded across 30,000+ vehicles.
China (Unitree, AgiBot) is pushing prices down and volumes up, framing humanoids as quasi-consumer hardware: Unitree’s R1 starts at $4, 900, G 1 a t$ 13,500 with advanced perception; AgiBot runs ~100 teleoperated humanoids up to 17 hrs/day and has manufactured ~10,000 units.
Norway’s 1X (backed by OpenAI) opened a waitlist for its NEO household robot at ~ $20, 000 (or$ 499/month) for 2026 US deliveries.
Other named platforms: Sanctuary AI’s Phoenix (Canada, commercial pilots — “hundreds of commercial pilot tasks completed”), UBTECH’s Walker S/S2 (China, LLM-integrated planning + autonomous battery swapping), Neura Robotics’ 4NE-1 (Germany, artificial skin for safe human collaboration), Boston Dynamics’ Atlas (US, locomotion/manipulation research testbed), Tesla’s Optimus Gen 3 (US, internal logistics, plans for external sales by 2027), and Skild AI’s foundation-model stack (US, designed to work across multiple robot bodies).
The report’s bottom line: “most company milestones are framed in the future tense, along with delivery timelines; intended use cases are offered in place of verified operational data. It remains unclear whether the demand for humanoid robots will match the supply currently being built.”

Highlight: Physical AI and Foundation Models for Robotics

For AI to act usefully in physical space it must perceive its surroundings, reason about how objects behave, and act through a body — the chapter’s benchmarks that require exactly this (RLBench, BEHAVIOR-1K, ResponsibleRobotBench) are consistently its hardest. Vision-language-action (VLA) models replace the traditional perceive/plan/act pipeline with a single network running directly from camera input and language instructions to motor control: Physical Intelligence’s π₀ (2024) and π0.6 (2025) demonstrate cross-platform tasks like laundry folding without task-specific retraining; Nvidia’s GR00T models and Gemini Robotics pursue the same direction, training single models that control different robots across tasks. The chapter names data as the binding constraint — every unit of robot training data requires either a physical robot performing the task or a high-fidelity simulation, both slow and expensive. World Foundation Models (Nvidia’s Cosmos is the report’s example) generate synthetic physics data to train around that bottleneck, but “VLA technology remains at the research stage, and the gap between what these models can do in a controlled setting and what they can handle in the real world is still wide.”

Self-Driving Cars — the sim-to-real success counterexample

Where household-robot deployment stalls near 12% real-task success, self-driving is the chapter’s example of physical-world AI actually crossing into mass-scale deployment in 2025.^[inferred: framing self-driving as the sim-to-real “success” bookend to the household-robot “struggle” story is this article’s synthesis of two separately reported facts — the report itself does not explicitly pair them in a single sentence, though Chapter Highlight #11 (self-driving) immediately follows #10 (household robots) in the printed numbered list]

Waymo operated roughly 2,500 fully autonomous robotaxis across five US cities (Phoenix, San Francisco, Los Angeles, Austin, Atlanta) by late 2025, recording ~450,000 weekly trips. In California alone, weekly paid trips climbed from near zero in mid-2023 to ~283,880 by late 2025 (Figure 2.7.5), with sharp growth after February 2025. Zoox began appearing in California pilot-trip data in late 2025 (Figure 2.7.6).
China: Baidu’s Apollo Go provided ~11 million fully driverless rides in 2025 — a 175% year-over-year increase, up from 1.5 million trips in 2022 (Figure 2.7.7).
Europe: operators (Mobileye, Vay, Wayve) are active, but comparable deployment data is not publicly available, “limiting the global picture.”
Caveat the report states directly: “deployments so far are in areas with generally favorable weather and humans are available off-site to take over when necessary.”

Technical innovations: benchmarks are consolidating around end-to-end driving leaderboards (Waymo’s 2025 Open Dataset Challenges, emphasizing vision-based approaches and long-tail generalization); Nvidia’s PhysicalAI Autonomous Vehicles dataset adds multicamera, lidar, and radar data across varied weather, geography, and rare events; combined reasoning-and-action models such as Alpamayo 1 (a VLA) target both trajectory quality and interpretable reasoning under real driving’s safety and latency constraints; multimodal reasoning benchmarks are shifting toward multiview spatial reasoning and step-by-step driving logic rather than final-answer accuracy alone; world models and reinforcement learning are moving autonomous driving beyond imitation-only training. Driving-data volume has grown from single-digit hours in early benchmarks (2012-2019: KITTI, nuScenes, Argoverse v1/v2, CARLA) to roughly 500 hours (Waymo Open Dataset, 2019) into the 1,300-1,800-hour range by 2024-2025 (nuPlan; Nvidia’s Physical AI-AV) (Figure 2.7.8) — though the report cautions this is a volume trend only, since “a dataset of simulated driving is not the same as one captured from real cars on real roads, even if both report the same number of hours.”

Safety: NHTSA’s Standing General Order (General Order on Crash Reporting), first issued 2021 and amended in 2021, 2023, and 2025, mandates that manufacturers and operators report certain crashes involving automated driving systems (ADS) or SAE Level 2 driver-assistance systems. Monthly reported ADS incidents have generally trended upward since mid-2021 — from roughly 10-25 per month in early years to frequently exceeding 80 per month in late 2024 and 2025. Waymo accounts for the largest share of reported incidents, which the report attributes to its much larger deployment footprint; other operators (Ford, May Mobility, Transdev Alternative Services) report lower and more stable incident counts.

Try It

For capability-trajectory decks: pair the SWE-bench (60%→~100%), Terminal-Bench (20%→77.3%), and CyBench (15%→93%) trajectories as three separately-sourced confirmations of the same “the frontier moves faster than benchmarks can stay hard” story — Chapter Highlight #1’s own framing, backed by Humanity’s Last Exam’s +30pp jump in a single year.
For “AI is powerful but unreliable” stakeholder framing: the jagged-frontier catalog above gives eight ready-made pairings beyond the canonical IMO-vs-clock example — pick the one closest to the audience’s domain (PlanBench for ops audiences, ResponsibleRobotBench for safety-sensitive audiences, benchmark invalid-question rates for a technical/skeptical audience).
For agent-benchmark selection when evaluating a vendor’s agent claims: match the benchmark to the task shape using the AI Agents table above — GAIA for general research-assistant tasks, OSWorld/WebArena for GUI/browser agents, MLE-bench for data-science agents, CyBench for security agents, tau-bench for customer-facing tool-calling agents with policy constraints.
For robotics/physical-AI timeline conversations: use self-driving cars (mass-scale 2025 deployment) as the “this is what physical-world AI success looks like after a decade of real+simulated data investment” counterpoint to humanoid robotics and household-task robots, both still pre-deployment as of this report.

Open Questions

Pages 81-92 of Chapter 2 (§§2.2-2.3, between “Overall Performance Trends” and “Reasoning” — likely covering multimodal understanding and/or a distinct coding-benchmark subsection given the chapter’s table of contents, which lists LegalBench at p.111 just before §2.6) were out of scope for this deep-dive per the assigned page ranges and are not covered here.
Figure 2.7.9 (monthly ADS-incident counts referenced in the Safety subsection’s prose) falls just past the read page range — the trend direction (10-25/month in early years to >80/month in late 2024-2025) is captured from body text, but the chart’s exact monthly values were not verified.
The Vibe Code Bench chart (Figure 2.5.3) labels one bar “GPT 5.2 Codex” (37.91%) where the surrounding prose references “GPT 5.3 Codex (41.4%).” Both figures are reproduced above as printed in the report; whether this reflects two distinct real model variants or a labeling inconsistency in the source document was not resolved.

Jonathon's AI Wiki

Explorer

Stanford HAI AI Index 2026 — Chapter 2 Technical Performance Deep-Dive

Key Takeaways

Overall Performance Trends

Benchmarks vs. human baseline

Closed- vs. open-weight models

US vs. China

Model performance converges at the frontier

Benchmarking AI — reliability and gaming concerns

The Jagged Frontier Catalog

SWE-bench and the Software Trajectory

SWE-bench — resolving real GitHub issues

Terminal-Bench — real terminal environments, chained multi-step tasks

Vibe Code Bench — autonomous end-to-end app building

AI Agents Benchmarks

Robotics and the Sim-to-Real Gap

RLBench — controlled-simulation manipulation

BEHAVIOR-1K — long-horizon household tasks, simulated

ResponsibleRobotBench — completion and safety

Highlight: Humanoid Robotics — hardware and investment outpacing deployment

Highlight: Physical AI and Foundation Models for Robotics

Self-Driving Cars — the sim-to-real success counterexample

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

Stanford HAI AI Index 2026 — Chapter 2 Technical Performance Deep-Dive

Key Takeaways

Overall Performance Trends

Benchmarks vs. human baseline

Closed- vs. open-weight models

US vs. China

Model performance converges at the frontier

Benchmarking AI — reliability and gaming concerns

The Jagged Frontier Catalog

SWE-bench and the Software Trajectory

SWE-bench — resolving real GitHub issues

Terminal-Bench — real terminal environments, chained multi-step tasks

Vibe Code Bench — autonomous end-to-end app building

AI Agents Benchmarks

Robotics and the Sim-to-Real Gap

RLBench — controlled-simulation manipulation

BEHAVIOR-1K — long-horizon household tasks, simulated

ResponsibleRobotBench — completion and safety

Highlight: Humanoid Robotics — hardware and investment outpacing deployment

Highlight: Physical AI and Foundation Models for Robotics

Self-Driving Cars — the sim-to-real success counterexample

Try It

Open Questions

Related

Graph View

Table of Contents

Backlinks