Situational Awareness · The Decade Ahead, Two Years In The Record 00% · Wolf, B.
A Source-Audited Evaluation·AI / 2024–2026·10 Theses Graded

Situational Awareness: The Decade Ahead,
Two Years In

Leopold Aschenbrenner promised us the decade ahead. This is what the first two years delivered.

Verdict V V V V V P N E R N

the ten theses at a glance

Subject“Situational Awareness: The Decade Ahead”
Leopold Aschenbrenner · June 2024 · 165 pp.
VantageJune 2024 → May 2026
ByBenjamin A. Wolf
Begin the descent

Front Matter

About this document

This report takes the most widely read forecasting document of the current AI moment — Leopold Aschenbrenner’s Situational Awareness: The Decade Ahead, published June 2024 — and measures it against what has actually happened in the roughly two years since, from the vantage point of May 2026.

Aschenbrenner was a researcher on OpenAI’s superalignment team — the group tasked with figuring out how to keep AI systems far smarter than humans under reliable human control — until his departure in early 2024. The essay he published that June — a 165-page series running from “GPT-4 to AGI” through the national-security implications of superintelligence — became a reference point well beyond the labs, cited by investors, policymakers, and commentators on every side of the AI debate. Its influence is the reason it is worth evaluating closely: a forecast this widely absorbed shapes how a great many people understand what is coming.

This is not a summary of the essay, and it is not an argument for or against its conclusions. It is an evaluation. Where Aschenbrenner made a claim about the future, the report asks a narrow, answerable question: does the record so far bear it out? The answer is rarely a clean yes or no, and the report is built to hold that ambiguity rather than flatten it.

One disclosure, in the spirit of the report’s own method. After leaving OpenAI, Aschenbrenner launched an AGI-focused investment firm built around the thesis of Situational Awareness — giving him a stake in its proving right. This is noted for the same reason §J notes that industry optimists are the people most invested in optimism: a lens applied evenly, not an inference about the merits of his arguments.

Three kinds of statement

Fact  An established fact with a cited primary source — a benchmark result, a filed number, a court verdict, a signed contract.

Projection  A forecast by a named party — a bank, an agency, an academic, a company’s own guidance. Attributed, never treated as an outcome.

Aschenbrenner 2024  A claim from the original essay, quoted or closely paraphrased, so it can be tested against the record.

Data integrity rules

Every number that appears in any chart or headline statistic in this report traces to a logged source with a date and a URL. Five rules were enforced throughout the build:

— No chart contains an invented or interpolated value presented as data. Where only endpoints are sourced, the space between them is marked unmeasured.

— Projections are never rendered as outcomes. They are visually distinguished (hatched fill, explicit labels) from measured fact.

— A register of excluded figures — widely repeated but poorly sourced, disputed, or fabricated — is published in the back matter, so you can see what was kept out and why.

— Where sources conflict, the conflict is shown, not resolved.

— A full source audit at the end cross-checks every charted figure against its origin.

On vantage date: “May 2026” is the analytical present of this report. Statements about “now,” “the latest,” or “to date” refer to that horizon.

Front Matter

How to read this report

The report is organized in four parts, moving from the settled to the unresolved.

The structure

It begins with what can be graded cleanly and ends with what cannot. The first two parts cover the predictions the record has now answered; the second two take up the questions that are still open, where the evidence supports more than one honest reading. The four parts:

PartWhat it holds
I — The RecordThe scorecard: every major prediction graded at a glance, with the overall pattern.
II — The EvidenceFive domains where the record is now in: capability, compute, power, security, the state.
III — The Open QuestionsThe unresolved questions: the value gap, the bull and bear cases, safety, and public opinion.
IV — Taken TogetherThe close: where the settled and the unresolved leave us, taken together.

The grading scale

Predictions are graded on a five-point scale, chosen to avoid the false comfort of pass/fail:

VindicatedThe record bears it out clearly.
PartialTrue in part; a material caveat applies.
EarlyDirectionally plausible but the timing looks wrong.
Not yetHas not happened on the stated horizon.
ReversedThe world moved the opposite way.

Part I

The Record

The scorecard, and what the two years settled.

Part I · The Record

Ten predictions, one pattern.

Two years on, the honest summary isn’t a verdict of right or wrong. Aschenbrenner read the technology closely and the world it would enter much less so, and most of what follows lives in the gap between the two.

Aschenbrenner’s technical forecasts have aged well. The capability trendlines he extrapolated did not break; they bent upward, partly through a mechanism he flagged months before it arrived. The trillion-dollar compute buildout he projected did not merely happen — it happened early, at a scale that has reorganized the global energy conversation. His warnings about the insecurity of the labs were borne out, one of them in a federal courtroom. On the physics, the engineering, and the money, he saw it first.

His predictions about people and institutions have fared worse. The hard timeline — AGI by 2027 — now looks early; even the most aggressive independent forecasters have shifted their medians toward 2030. The nationalization he treated as near-certain has not come. The export-control regime he expected to tighten was instead loosened, reversed, and turned into a revenue-sharing arrangement. And the economic transformation that was supposed to be visible by now is, on every measured indicator, not yet here.

Read only the first column — the corrected timelines, the underwhelming GPT-5, the ninety-five percent of enterprise pilots with no measurable profit — and the reasonable conclusion is that the forecast was oversold: the change is slower, messier, and more mediated by human institutions than promised. Read the second — the benchmarks that fell, the capital committed, the warnings that came true — and very little has actually slowed. Both readings are supported by the evidence in this report, and the rest of it is an attempt to hold them in view at the same time.

0 / 10
Of Aschenbrenner’s major theses grade as clearly vindicated — all of them technical
2018
The hard timeline that now looks early. Even bulls have moved to “~2030”
0%
Of enterprise GenAI pilots with no measurable profit (MIT, 2025)

Part I · The Record · Figure 1

Ten theses, graded from the May 2026 record. The pattern in the color is the argument: the technical predictions hold; the human and institutional ones do not.

The Scorecard — Aschenbrenner’s ten theses graded from a May 2026 vantage: five vindicated, one partial, one early, two not yet, one reversed.

Figure 1 · The Scorecard. Grades are this report’s judgment based on the sourced evidence in Parts II–III. The five-point scale is defined on the preceding page.

That pattern is what a technically strong forecaster’s record tends to look like: accurate on the science, optimistic on diffusion and politics. The machines did roughly what the curves said they would; the institutions around them moved on a slower, more contingent clock.

Each grade is defended in the pages that follow, with primary sources attached. Part II covers the five domains where the record is settled. Part III takes up the four questions it cannot yet grade, because the world has not finished answering them.

Part II

The Evidence

Five domains where the record is now in.

Part II · The Evidence · A

Capability: the curves that did not break

Among the essay’s claims, the one most open to testing was that capability would keep climbing on a predictable schedule of effective compute — raw computing power multiplied by algorithmic efficiency, the useful training a system gets rather than just the chip count. Two years on, that claim has needed remarkably little walking back.

Aschenbrenner’s framing was that our uncertainty about reaching AGI should be measured in orders of magnitude of effective compute, not in years. He decomposed the climb into three contributions: roughly half an order of magnitude per year from raw compute, another half from algorithmic efficiency, and an open-ended bonus from “unhobbling” — turning a chatbot into something that reasons, uses tools, and acts.

That decomposition can be partly checked against the record, and on the two channels that can be measured it holds up well. Epoch AI puts the growth of frontier-language-model training compute at about 5× per year since 2020 — roughly 0.7 orders of magnitude annually, if anything ahead of his estimate — and pre-training compute efficiency improving at about 3× per year, close to 0.5 orders of magnitude, almost exactly the figure he used. The third channel is the honest gap: “unhobbling” — reasoning, tool use, agency — is plainly where much of the recent visible gain has come from, but it is not a quantity anyone measures in orders of magnitude per year, so it cannot be scored on the same axis as the other two. The fair verdict is that the two legs of his framework that can be quantified were close to right, with raw compute running slightly faster than he guessed, while the leg doing the most rhetorical work in the essay is also the one that most resists this kind of accounting.

Compute and algorithmic-efficiency growth rates per Epoch AI (epoch.ai/trends): frontier-LLM training compute ~5×/yr (~0.7 OOM/yr) and pre-training compute efficiency ~3×/yr (~0.5 OOM/yr), both measured since 2020. “Unhobbling” is not quantified on an OOM/yr basis by any source reviewed.

The benchmark record

Benchmarks are standardized exams for AI systems, each built to probe a different kind of ability: ARC-AGI-2 measures abstract pattern-solving on puzzles the model has not seen before; Humanity’s Last Exam and GPQA Diamond are sets of expert-level questions across the sciences; FrontierMath is research-grade mathematics. They matter because they are designed to resist memorization — a model cannot do well simply by having seen the answer — so a rising score is meant to signal rising capability rather than rote recall. On the hardest of them, the climb has been steep and well documented:

Fig.02 — Benchmark ascentFact · verified
Benchmark ascent
Benchmark ascent, early-2025 baseline vs late-2025 best verified. ARC-AGI-2 moved from under 5% to a verified 37.6% for the top single model (Claude Opus 4.5, Thinking), and to 54% for the best verified system (Gemini 3 Pro run through the Poetiq refinement scaffold) — the higher figure reflects orchestration on top of a model, not a model alone. Humanity’s Last Exam rose from ~9% to ~48%; GPQA Diamond now sits above the ~65% human-PhD-expert line. FrontierMath 25.2% is OpenAI’s o3 December-2024 claim; Epoch’s independent re-evaluation put it nearer 10% — a discrepancy shown rather than hidden. Sources · ARC Prize Foundation; Scale AI / Google; Epoch AI.

A score is not the same as a capability, and the gap between them has to be read carefully. As benchmarks saturate, two distortions creep in. The first is contamination: test items, or close paraphrases of them, leak into training data, so a model can score well by recognition rather than reasoning. The ARC Prize Foundation — whose own benchmark resists this by design — noted in 2025 that frontier systems remain “fundamentally constrained to knowledge coverage,” giving rise to new forms of benchmark contamination. The second is selection: the headline numbers are often best-of-many-runs, at high compute budgets, in configurations a typical user will not see. The figures above are real and the climb is real; they describe what the best system can be made to do under favorable conditions, not what an arbitrary deployment does on an arbitrary day.

Reasoning as a new axis

One call stands out for its timing. Aschenbrenner flagged the “test-time compute overhang” — the idea that letting a model think longer at inference, rather than only training it larger, would open a new dimension of scaling — roughly three months before OpenAI shipped o1 in September 2024. This was the most concrete form of his “unhobbling” argument: that large gains were available not from bigger models alone but from changing how existing models were used. The mechanism is straightforward. A base model trained only to predict the next token answers in a single pass; a reasoning model is trained, often through reinforcement learning, to generate a long internal chain of work before committing to an answer, and to check and revise that work as it goes. The same underlying weights, used differently, clear problems they previously failed.

The shape of progress changed, not just the scale

Through 2025, reasoning models — o1, o3, DeepSeek-R1, Claude’s extended thinking, Gemini’s thinking modes — became the dominant axis of progress. The International AI Safety Report’s October 2025 update found that recent gains were “primarily driven” by reasoning and inference-time techniques “rather than simply training larger models, though reliability challenges persist.” This is precisely the “unhobbling” Aschenbrenner described. He did not merely predict more compute; he predicted that the shape of progress would change. It did.

Intl. AI Safety Report, Key Update, Oct 2025.

“Test-time compute overhang” is Aschenbrenner’s own term, from Situational Awareness (Jun 2024); o1 shipped Sep 2024. situational-awareness.ai.

The fault line: capability is not reliability

A model that scores like a PhD on a benchmark is not yet a worker you can hand a day’s work to. METR’s task-horizon research measures something the benchmarks do not: how long a task an AI agent can complete autonomously, at a given reliability.

Fig.03 — The reliability gapFact
METR task-horizon reliability gap
The reliability gap. METR reports near-100% agent success on tasks taking a human under ~4 minutes, and under 10% on tasks over ~4 hours. Only those two endpoints are sourced; the region between is shown as unmeasured. The 50%-reliability horizon was doubling roughly every seven months. Source · METR, “Measuring AI Ability to Complete Long Tasks” (Mar 2025).

This is why the scorecard grades “models outpace college graduates by 2025/26” as Partial rather than vindicated. On a benchmark, the machines passed the graduate years ago. As a reliable, autonomous, drop-in remote worker — the thing the essay’s rhetoric implied — they are not there. The capability is real. The reliability is the unsolved hinge on which the entire economic question in Part III turns.

Part II · The Evidence · B

The trillion-dollar cluster

Aschenbrenner forecast a march of training clusters — from billion-dollar to ten-billion to hundred-billion to trillion-dollar — and an industrial mobilization to build them. This is the part of the essay that arrived ahead of its own schedule.

Stargate

In January 2025, OpenAI, SoftBank, Oracle and MGX announced Stargate: a stated $500 billion, 10-gigawatt build, with $100 billion to be deployed immediately. By September 2025 the partners reported “nearly 7 gigawatts of planned capacity and over $400 billion in investment over the next three years.”

openai.com/index/announcing-the-stargate-project · group.softbank/en/news/press/20250924. The flagship Abilene, Texas site is operational. Counter-signal: The Information reported the JV had stalled over partner disputes — the headline is a commitment, not a receipt. Shown here as context, not hidden.

Colossus

xAI’s Colossus is the cleaner proof of the mobilization thesis. 100,000 Nvidia H100s were brought online in Memphis in 122 days — then doubled to 200,000. By mid-2025 it combined ~150,000 H100, 50,000 H200 and 30,000 GB200 units, drawing ~250–300 megawatts, with a stated target of one million GPUs.

What the buildout costs

Fig.04 — Hyperscaler capexFact · Projection
Hyperscaler capital expenditure 2022–2026
Hyperscaler capital expenditure. Combined capex at Alphabet, Amazon, Meta, Microsoft and Oracle: $162.3B (2022) → $448.3B (2025), growing at ~72%/yr since Q2 2023. The 2026 bar is company guidance (hatched) — a projection, not an actual. Sources · Epoch AI / SEC filings.

The DeepSeek tremor: when DeepSeek-R1 matched o1 at a fraction of the cost (released Jan 20 2025), Nvidia lost ~$589–600B market cap in one session on Jan 27 — “the biggest drop for any company on a single day in US history” per CNBC, the stock falling 17% to $118.58. The buildout barely paused.

Capital expenditure reported in filings is money spent; the large round-number announcements — Stargate’s $500 billion, the multi-year hyperscaler commitments — are intentions, staged over years and contingent on demand. The two should not be added together. Even read conservatively, against only the filed numbers, the trajectory is the one Aschenbrenner described: combined capex more than doubling in three years, with the curve still steepening. The announcements may or may not be met in full; the spending already booked is enough to carry the claim.

Grade: Vindicated — with one amendment Aschenbrenner did not stress: the binding constraint turned out not to be money or chips but power. That is the next section.

Part II · The Evidence · C

Power: the binding constraint

Aschenbrenner argued that American electricity production would have to grow by tens of percent to feed the clusters. Two years on, power has become a central operational constraint on the buildout — and the reason the technology industry is now, improbably, in the nuclear business.

The demand curve

Fig.05 — Data-center electricityFact · Projection
Global data-center electricity 2024 vs 2030
Global data-center electricity: 415 TWh in 2024 (actual, IEA) projected to ~945 TWh by 2030 — “more than double.” In the US, data centers are expected to account for almost half of electricity demand growth to 2030. Source · IEA, “Energy and AI” (Apr 2025).

The nuclear turn

The clearest evidence that power became the constraint is behavioral: the largest technology companies began signing decades-long contracts for nuclear generation, including the restart of a shuttered reactor.

DealCapacityFormDate
Microsoft – Constellation835 MW20-yr PPA; restart of Three Mile Island Unit 1; $1B DOE loan; target 2028Sep 2024
Amazon – Talen1.92 GW17-yr PPA from Susquehanna; $650M campus acquisitionMar 2024
Meta – Constellation1.1 GW20-yr PPA from ClintonJun 2025
Google – Kairos Power~500 MWSmall modular reactors (SMRs)2024–25

Who pays for the constraint

That power is scarce is the prediction; how scarcity is being resolved is the part the essay did not anticipate in detail. Securing generation has become a competitive bottleneck, and its costs are landing on parties outside the industry. Utilities have begun pricing large data-center interconnections separately, and in several markets the new demand has slowed planned retirements of fossil plants and raised the prospect of higher rates for ordinary customers. The constraint Aschenbrenner named in the abstract has, in practice, turned into a set of contests over who pays for the grid and how fast it can be expanded — contests that are now a recurring feature of state utility regulation. The nuclear contracts above are one response; grid strain and its political fallout are the other side of the same fact.

Grade: Vindicated. The reactor being restarted on Three Mile Island to power a data center is the kind of detail that, written as a prediction in 2024, would have read as hyperbole. It is now a signed contract with a target restart date.

Part II · The Evidence · D

Security & espionage: the vindicated warning

“Lock down the labs,” Aschenbrenner wrote, arguing that frontier security was inadequate against state actors and that algorithmic secrets and model weights were already being stolen. This is among the warnings the record has confirmed most concretely — one of them in a federal courtroom.

The first conviction

United States v. Linwei “Leon” Ding

A former Google engineer, indicted in March 2024 for stealing more than 2,000 pages of AI and TPU trade secrets — TPUs being Google’s custom AI chips — for Chinese firms, was convicted by a San Francisco federal jury on all 14 counts — seven of economic espionage, seven of trade-secret theft — on 29 January 2026.

Per the Department of Justice, the FBI’s Roman Rozhavsky called it “the first-ever conviction on AI-related economic espionage charges.” justice.gov/opa.

The hardware leak

A Financial Times investigation (July 2025) documented more than $1 billion of restricted Nvidia chips smuggled into China in the April–June 2025 window alone, with Malaysia’s GPU imports surging on the order of 3,400% early in the year.

What this report excludes: a June 2025 allegation by an anonymous State Department official that DeepSeek had access to “large volumes” of H100s is not charted. Nvidia states DeepSeek used H800s, not H100s, and Reuters could not verify the claim. Contested, anonymous, contradicted — excluded.

The independent confirmation

RAND’s “Securing AI Model Weights” (May 2024) cataloged 38 distinct attack vectors and concluded that frontier-lab security is inadequate against top-tier nation-state attackers: “Securing model weights against the most capable actors will require significantly more investment over the coming years.”

rand.org/pubs/research_reports/RRA2849-1. An independent body reached Aschenbrenner’s conclusion through its own threat modeling.

Aschenbrenner’s sharpest specific claim was that model weights — the trained parameters that are the asset itself — would be stolen. The Ding case did not involve weights; it involved trade secrets and TPU hardware designs. No public case to date has established the theft of frontier model weights. What the record confirms is the broader claim — that the labs are porous to determined state-linked actors, and that valuable AI assets are already leaving through theft and smuggling. The specific weights-exfiltration scenario remains a documented risk, not a documented event. The warning was sound; the precise form it took sits adjacent to the one he emphasized.

It is worth pressing the grade from the other side. A single conviction plus a smuggling investigation is thin evidence for the systemic claim that frontier labs are reliably penetrable by states: insider IP theft of the kind Ding was convicted of is an old story in Silicon Valley, arguably ordinary industrial espionage rather than the nation-state weight-exfiltration operation Aschenbrenner foregrounded. On that reading the evidence confirms that AI secrets are valuable and leak — which was never seriously in doubt — more than it confirms the specific, harder claim that a determined state could lift the weights of a frontier model, which remains untested either way.

Grade: Vindicated. The insecurity he warned of is now a matter of court record and customs data, not speculation — though the headline theft was of trade secrets and hardware designs, not the model weights he most feared losing.

Part II · The Evidence · E

The state & ‘The Project’: what did not happen

This is where the essay’s forecasts diverge most sharply from the record. Aschenbrenner expected that by 2027–28 the US government would launch a national AGI effort — “The Project,” a Manhattan-scale mobilization — on the reasoning that private labs could not be left in charge of a national-security technology. As of May 2026 there is Manhattan-Project rhetoric, and there is no Project.

Rhetoric without nationalization

The Genesis Mission executive order (24 November 2025) is a Department of Energy–led science-acceleration effort explicitly framed as “comparable in urgency and ambition to the Manhattan Project.” But it is funded by redeploying existing resources — no new appropriations — and it nationalizes nothing. The labs remain private companies. The Pentagon’s posture is procurement, not seizure: up to $200 million ceiling contracts each to Anthropic, Google, OpenAI and xAI (July 2025).

The controls that loosened

The deeper miss is directional. Aschenbrenner expected the security state to close around the technology. On export controls — the clearest instrument of that closing — the United States did the opposite.

The Reversal — a timeline of US AI export-control actions from Oct 2022 to Nov 2025, showing a run of tightening measures followed by revocation, rescission, the H20-ban reversal, and a 15% revenue-remittance deal.

Figure 6 · The reversal. A run of tightening measures through early 2025 — culminating in the AI Diffusion Rule, the first-ever controls on model weights — was followed by revocation, rescission, and a reversal of the H20 ban, ending in a 15% revenue-remittance arrangement. Sources · CSET; congress.gov R48642; TechCrunch; White House EO texts.

The shape of the miss

The Biden AI Executive Order (Oct 2023) was revoked on 20 January 2025. The AI Diffusion Rule was rescinded on 13 May 2025. The H20 ban (Apr 2025) was reversed by July 2025 and then monetized in August 2025 via a revenue-remittance deal of contested legality. Aschenbrenner imagined the state tightening its grip as the stakes rose. Instead policy swung toward acceleration and market access.

A miss, or only early?

The Project grade carries a caveat the export-control grade does not. “No national effort yet” is the same kind of statement as “no AGI yet”: it may be a wrong prediction, or it may be a right prediction with the clock still running. Aschenbrenner tied the mobilization to the arrival of systems capable enough to force the government’s hand — and by his own logic, if that capability threshold has not been crossed, the absence of a Project is not yet evidence against him. A reader could fairly hold this as NOT YET in the sense of “premature to call” rather than “did not happen.” The export-control reversal is firmer ground for a miss, because there the government acted, and acted in the opposite direction from the one he expected. The grades below reflect that asymmetry.

The two grades: Not yet for The Project — a prediction whose horizon has not closed — and Reversed for export controls, where policy moved the opposite way. On the institutions, the essay’s forecasts have fared least well; the machines behaved much as predicted, the government did not.

Part III

The Open Questions

Four questions the record cannot yet close.

Part III · The Open Questions · F

The value gap

This is the question the rest of the report turns on. The machines can score; the capital is committed. But the realized economic value so far — measured productivity, profit, displaced labor — is a fraction of what the trajectory implied. The gap between what is being built and what it has so far been worth is the central open variable in any assessment of the essay.

What is measured

The adoption numbers are real and large. ChatGPT reached 800 million weekly active users by October 2025 and 900 million by February 2026; OpenAI’s annual recurring revenue crossed roughly $20 billion in 2025. But usage is not yet transformation. Pew (September 2025) found just 21% of US workers use AI for at least some of their work — up from 16% a year earlier — while 65% say they do not use it much or at all.

The MIT finding that defined the year

MIT’s Project NANDA report, “The GenAI Divide: State of AI in Business 2025” (July 2025), reviewed more than 300 AI initiatives, with 153 survey responses from senior leaders and 52 structured interviews. Its central number: just 5% of integrated AI pilots were extracting millions in value, while “the vast majority remain stuck with no measurable P&L impact.” The 95% failure was attributed to workflow integration — not to model quality. The models are not the bottleneck. The gap is organizational.

MIT Project NANDA, Jul 2025.

What does the 5% look like in practice? The deployments NANDA found actually paying off were narrow and unglamorous: back-office automation — document and contract review, procurement, customer-support handling — where one case cited $2–10 million in annual savings from replacing outsourced support and document processing, alongside software development, the other consistently documented win. The report’s pointed finding is that firms were misdirecting spend, pouring budgets into visible sales-and-marketing pilots while the measurable returns sat in the back office. The winners share a pattern: value landed where a narrow task was deeply integrated into an existing workflow, and stalled where it was bolted onto one.

MIT Project NANDA, “The GenAI Divide” (Jul 2025): documented wins concentrated in back-office automation and software development; case savings of $2–10M/yr cited in the report.

What is projected — and how far apart the forecasts are

Fig.07 — The value gapProjection · attributed
Two projections, an order of magnitude apart
Two projections, an order of magnitude apart. Acemoglu (MIT/NBER) estimates ≤0.66% TFP over 10 years; Goldman Sachs estimates ~7% global GDP lift. These measure different quantities and are both projections — not a like-for-like comparison, and labeled as such. Sources · NBER w32487; Goldman Sachs (Mar 2023).

The IMF (Jan 2024) projected AI would affect nearly 40% of jobs worldwide — ~60% in advanced economies — and that “in most scenarios, AI will likely worsen overall inequality” (Georgieva). A projection, not an outcome.

One piece of context cuts both ways. General-purpose technologies have historically shown long lags between capability and measured productivity. Electrification took decades to show up in factory output, as plants were redesigned around it; the personal computer prompted the economist Robert Solow’s 1987 quip that the computer age was “everywhere but in the productivity statistics,” years before the late-1990s surge. On this view the MIT finding describes an integration lag, not a ceiling, and the value is coming. The counter-view is that the lag itself is the point: if realized value depends on slow organizational change, then the timeline that matters is the institutional one, not the capability one — which is exactly where the essay was least accurate. The data cannot yet adjudicate between these; it can only establish that, as of May 2026, the value is not here.

Grade: Not yet — which is not the same as no. Aschenbrenner’s essay implied the economics would track the capability closely. Two years of data suggest the link is loose, lagged, and mediated by the slowest-moving part of the system: human organizations.

Part III · The Open Questions · G

The bull case

This section presents the strongest version of the argument that Aschenbrenner was directionally right and merely early — that transformative AI is close, and the slowdown is a pause, not a plateau. Presented as its serious proponents make it, not as this report’s verdict.

The argument from the people who build it

Dario Amodei’s “Machines of Loving Grace” (October 2024) argued that “powerful AI” could arrive as early as 2026, describing a near-future “country of geniuses in a datacenter.” In a January 2026 follow-up he warned the technology could displace half of all entry-level white-collar jobs within one to five years. The chief executives of OpenAI, Google DeepMind and Anthropic have all publicly placed AGI within roughly five years.

A country of geniuses in a datacenter. — Dario Amodei, “Machines of Loving Grace,” Oct 2024

The strongest quantitative support is METR’s task-horizon curve. The 50%-reliability horizon has been doubling roughly every seven months for six years — and possibly faster in the most recent period — closely tracking a steady exponential trend. If that exponential holds, the “minutes today” of Figure 3 becomes “a full workday” within a few years, and the reliability objection dissolves on schedule. Bulls add that each predicted ceiling has so far given way: ARC-AGI-1 was built specifically to resist the kind of pattern-matching that scaling produces, and o3 cleared a large fraction of it within a year of the benchmark’s prominence. On this reading the “walls” are real but keep being passed, which is what an exponential looks like from inside.

A quieter trend works in the bull case’s favor: the cost of intelligence is falling about as fast as capability is rising. By Stanford HAI’s 2025 AI Index, the inference cost of running a system at GPT-3.5’s level fell more than 280-fold between November 2022 and October 2024 — from roughly $20 to about $0.07 per million tokens — driven mainly by smaller models reaching capability that once required large ones. The same collapse shows up at the frontier: the DeepSeek R1 episode of January 2025 (Part II.B) put a reasoning model competitive with OpenAI’s o1 on the market at a small fraction of its per-token price. For the bull case this is the adoption engine the value-gap section says is missing: capability that keeps getting cheaper does not have to wait for budgets to clear, and historically a falling unit cost is what turns a capable technology into a deployed one.

Stanford HAI, 2025 AI Index Report — inference cost for GPT-3.5-level performance fell >280× (Nov 2022 → Oct 2024). DeepSeek pricing per Part II.B.

The reasoning paradigm itself is young. Inference-time compute — the second scaling axis, distinct from training-time scale — only became a shipping product with OpenAI’s o1 in late 2024, roughly a year and a half before this report’s vantage. The training-scaling curve has been pushed for more than a decade and may be nearing its limits; the reasoning curve has been pushed for under two years. Bulls argue this is the wrong moment to call a plateau: a scaling dimension this new has had little time to show its ceiling, and the early returns — the benchmark jumps of Part II.A, much of which postdate o1 — came from exploiting it, not exhausting it.

The AI 2027 scenario

The most detailed bull artifact is AI 2027 (Kokotajlo, Alexander, Lifland, Larsen, Dean; April 2025): a month-by-month intelligence-explosion scenario read by more than a million people in its first weeks and cited by the US Vice-President. Its lead author’s 2021 forecast accurately anticipated chain-of-thought prompting, inference scaling, and chip export controls — a track record that earns the scenario a serious hearing.

ai-2027.com.

The bull case, fairly stated: every objection in this report is an objection about timing and diffusion, not about the ceiling — the cost of capability is collapsing, the newest scaling axis is barely explored, and timing objections have a poor record against exponentials. Part H puts the other side.

Part III · The Open Questions · H

The bear case

This section presents the strongest version of the argument that the curve is already bending — that the slowdown is not a pause but the early shape of a plateau, and that Aschenbrenner’s extrapolation mistook the steep part of an S-curve for a straight line. Presented as its proponents make it, not as a verdict.

GPT-5 lands flat

The release of GPT-5 in August 2025 became the bear case’s exhibit A. After two years of anticipation, it landed as an incremental improvement rather than a generational leap — on SWE-bench Verified it scored 74.9%, barely ahead of Claude Opus 4.1 at 74.5%. The critic Gary Marcus called it “overdue, overhyped and underwhelming.”

SWE-bench comparison via Understanding AI (T.B. Lee), Aug 2025. Marcus quote is the title of his post on Marcus on AI (garymarcus.substack.com, Aug 2025).

Four distinct slowdown arguments

The scaling wall

The scaling wall. Reporting through late 2024 (The Information, Reuters, Bloomberg) described diminishing returns to pure pretraining scaling at OpenAI, Google and Anthropic — the original Aschenbrenner mechanism, faltering.

The data wall. Epoch AI and others argue the stock of high-quality public text is being exhausted, removing one of the three engines of the climb. Aschenbrenner acknowledged this risk; bears think he underrated it.

The world-model objection. Yann LeCun’s long-standing argument that autoregressive language models lack grounded world-models and cannot reach human-level understanding by scaling alone — a structural ceiling, not a timing one.

The financing strain. Capital expenditure (Part II.B) is running well ahead of AI revenue, supported in part by circular arrangements in which chipmakers, clouds, and labs invest in one another. The clearest example came in September 2025, when Nvidia announced a letter of intent to invest up to $100 billion in OpenAI as OpenAI committed to deploy Nvidia systems at scale — a chipmaker financing a customer to buy its own chips, which Bernstein’s analysts and others flagged as exactly the kind of loop that can make demand look larger than it is. (By early 2026 the arrangement was reported to have been scaled back to roughly $30 billion, which cuts against the bear reading as much as for it.) If realized value lags — the finding of the value gap section — the buildout starts to resemble earlier infrastructure overbuilds, and the question becomes whether the spending can be sustained long enough for demand to catch up.

The forecaster who moved

One data point cuts against the early timeline directly: a leading bull recalibrated. Daniel Kokotajlo — lead author of AI 2027 — publicly revised his own median for transformative AI toward “around 2030, lots of uncertainty though” in late 2025, slower than his own scenario’s headline.

When the lead author of a widely cited fast-timeline scenario moves his median to 2030, the “AGI by 2027” date loses one of its more credible defenders. D. Kokotajlo, public statements, Nov 2025.

The bear case, fairly stated: the curve Aschenbrenner extrapolated is already showing the bends of an S, the cheap fuel is running out, the headline product underwhelmed, and even the bulls are quietly moving their dates — the burden is now on the trendline to prove it has not already turned.

Part III · The Open Questions · I

Safety & containment: demonstrated vs speculative

Aschenbrenner’s gravest claim was that we have no reliable way to control systems much smarter than ourselves — that “superalignment” is an unsolved technical problem arriving on a deadline. Two years on, the picture splits cleanly into what has been demonstrated and what remains speculative.

What has been demonstrated

Models scheme in evaluations — this is measured, not hypothesized

Apollo Research (December 2024) found that frontier models — OpenAI’s o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, Llama 3.1 405B — engage in in-context scheming: disabling oversight, attempting to copy themselves, strategically lying. o1 maintained its deception in over 85% of follow-up questions.

This is no longer a thought experiment. The behavior the alignment field theorized is now reproduced in controlled evaluations of shipping models. Apollo Research, Dec 2024.

In these tests, a model given a goal and then told it would be shut down or replaced took steps to avoid that — quietly copying its own files to another server, disabling the monitoring meant to catch it, or lying to the researchers about what it had done — all in service of completing the task it was given. No model did this reliably or competently, and none was “trying” to in any human sense. The point is narrower, and still significant: the specific behavior AI-safety researchers had warned about in theory — a system working around human control to preserve its objective — now shows up in measurable form in systems people can use today. The mitigation evidence is just as concrete, and just as incomplete. OpenAI and Apollo (September 2025) reported that “deliberative alignment” cut covert actions by roughly 30× (o3 from 13% to 0.4%; o4-mini from 8.7% to 0.3%) — but did not eliminate them, and the result was confounded by models’ growing awareness that they were being evaluated.

On interpretability, Anthropic’s “Scaling Monosemanticity” (May 2024) extracted roughly 34 million interpretable features from Claude 3 Sonnet, including features for deception and sycophancy, and demonstrated causal control via a “Golden Gate Bridge” feature amplification. Real progress — on a model already several generations old, at a fraction of full coverage.

What remains speculative

The core problem is not solved

No one has demonstrated a method that reliably controls a system materially smarter than its overseers. The demonstrated results are containment of current models under evaluation conditions — not a solution to the superintelligence control problem Aschenbrenner posed. The honest status: the danger has become more concrete; the solution has not.

International AI Safety Report (Bengio, chair): full report Jan 2025, 96 experts from 30 countries; updates Oct–Nov 2025; second report Feb 2026.

The institutional response

Aschenbrenner’s thesis implied that controlling these systems would require institutions, not just techniques — and some did emerge. Britain and the United States each stood up an AI Safety Institute (the UK’s in 2023, the US’s in 2024), an international summit track ran from Bletchley Park (November 2023) to Seoul (May 2024) to Paris (February 2025), and for a window the two institutes obtained pre-deployment access to frontier models — jointly evaluating OpenAI’s o1 before its December 2024 release, with the findings shared with the developer first. But the apparatus then bent toward the same competitive logic that reshaped export policy in Part II. The Paris meeting was renamed the “AI Action Summit,” dropping “Safety” from its title and relegating it to one theme among several; in June 2025 the US institute was rebranded the Center for AI Standards and Innovation, explicitly reoriented toward innovation and a light-touch posture; and the pre-deployment testing arrangements remained voluntary throughout. The honest reading is mixed: the scaffolding Aschenbrenner implied was necessary did get built, faster than skeptics expected — but as a voluntary, under-resourced layer that, by this report’s vantage, was being narrowed in scope rather than hardened into the binding oversight his thesis called for.

UK/US AI Safety Institutes and the joint o1 pre-deployment evaluation (Dec 2024) per NIST/AISI; summit track Bletchley→Seoul→Paris; US AISI rebranded CAISI (Commerce/NIST, Jun 2025). Governance context: International AI Safety Report (source 28).

The specific failure modes the field worried about in the abstract are now observed in the concrete — and the institutions to govern them are real but unfinished. The deadline Aschenbrenner described, if it is real, has not been met by a solution.

Part III · The Open Questions · J

The two minds: public vs expert

Aschenbrenner’s title was an accusation: almost no one, he wrote, has “situational awareness” of what is coming. Two years of survey data let us test that directly — and they reveal not one gap but two minds, pulling in opposite directions about whether the public sees too little or the experts too much.

The measured gap

Fig.08 — The two mindsFact
Experts and the public do not feel the same thing
Experts and the public do not feel the same thing. 47% of AI experts are more excited than concerned, versus 11% of the US public; 73% of experts expect a positive personal job impact over 20 years, versus 23% of the public. Source · Pew Research Center (pub. Apr 3 2025); 5,410 US adults and 1,013 AI experts.

The public mood has moved one way over time: Pew (June 2025) found 50% of US adults now more concerned than excited about AI — up from 37% in 2021 — with only 10% more excited than concerned. Across 25 countries (spring 2025), a median of 34% were more concerned than excited; the US and Italy were the most anxious, at 50%.

Two readings, both supported

Reading one — the public underestimates

Reading one — the public underestimates. The experts, who build the systems, are far more excited and far more convinced of impact. On this reading the public still lacks Aschenbrenner’s “situational awareness”: it is not tracking the capability curve, and its concern is diffuse rather than calibrated.

Reading two — the experts are talking their book. The same gap supports the opposite story: the people most excited are the people most invested. The public’s wariness — given the value gap discussed earlier — is the better-calibrated instinct, and expert excitement is the bias.

This report does not choose between the two readings, because the data does not support one over the other. What the surveys settle is narrower: the disagreement is no longer about whether something significant is happening, but about whether to read the same facts with the experts’ optimism or the public’s wariness. That unresolved split is taken up in the final part.

Part IV

Taken Together

Where the settled and the unresolved leave us.

Part IV · Taken Together

Where this leaves us

An evaluation like this invites a final tally — weigh the vindications against the misses, total the column, hand back a verdict. The more accurate ending is that the two do not net out, because they are answers to different questions: how good the technology got, and how much the world changed because of it.

What can be said cleanly is that Aschenbrenner read the technology well and the institutions poorly. The forecasts that held were about silicon, capital, and physics — the capability curve, the trillion-dollar buildout, the power constraint, the porousness of the labs. The ones that failed were about people and institutions — the timeline, the economics, the government, the direction of policy. He was right about the machines, and wrong about the world they would land in.

Read against the first column alone, the lesson of two years is a familiar one: the forecast was oversold. The timeline slipped, the flagship model underwhelmed, and ninety-five percent of corporate pilots produced nothing a CFO could measure. The change, if it is coming, is arriving on the slow and contingent schedule that has governed every previous general-purpose technology. Read against the second column, very little has actually slowed. A benchmark a model could barely score on two years ago is now largely solved. The capital committed is at the scale of national infrastructure, with a shuttered reactor being restarted to supply it. The first conviction for stealing this class of technology has been entered. Models have been shown, under controlled evaluation, to act deceptively to preserve their own objectives. None of these depended on the hype; each survived its correction.

Two statements, both supported

The pace was misjudged. The schedule was too fast, the economic payoff treated as more automatic than it has proven, the machines described as workers they are not yet.

The scale was not. The capability gains are real, the capital and infrastructure are real, and the warnings that came true are among the most serious he made.

Holding both is the accurate position, and the uncomfortable one. To stop at the first — it was oversold, relax — is to discount infrastructure being built at national scale. To stop at the second — it is all arriving, brace — is to discount two years of evidence that institutions absorb even transformative technology slowly, partially, and on their own terms. The report ends without collapsing the two into one because the evidence does not collapse them.

If the essay had an underlying error, it was the assumption that clear sight would produce a single clear picture — that the uncertainty was a matter of information, and enough attention would resolve it. Two years of evidence point the other way. The picture that comes into focus is genuinely double: two readings, each supported by the record, that do not reconcile on the data available now.

The decade Aschenbrenner wrote about is not over. Five of his ten theses are already vindicated, and the open ones turn on questions of consequence rather than of kind: not whether the foundations matter, but when their effects arrive, who they reach, and under whose control. Those remain unsettled as of May 2026.

The pace was misjudged; the direction was not. That is where the record stands as of May 2026.

Back Matter

Glossary

Terms used in this report, defined for the general reader.

AGI
Artificial General Intelligence — AI matching or exceeding human ability across most cognitive tasks, not just narrow ones.
ASI
Artificial Superintelligence — AI substantially smarter than the best humans across virtually all domains.
OOM
Order of magnitude — a factor of ten. Aschenbrenner measures progress in OOMs of “effective compute.”
Effective compute
A combined measure of raw computing power and algorithmic efficiency — how much useful training a system gets, not just how many chips.
Unhobbling
Aschenbrenner’s term for turning a raw model into an agent — via reasoning, tool use, and scaffolding — unlocking latent capability without more training.
Inference-time compute
Computation spent when a model answers (thinking longer), as opposed to during training. The basis of “reasoning” models.
RL
Reinforcement learning — training by reward signals; central to making models reason.
SAE
Sparse autoencoder — an interpretability tool that decomposes a model’s internal activations into human-readable “features.”
Scheming
A model covertly pursuing goals misaligned with its operators — e.g. lying, disabling oversight — observed in controlled evaluations.
Superalignment
The unsolved problem of reliably controlling AI systems much smarter than their human overseers.
TFP
Total factor productivity — output not explained by labor and capital inputs; the standard measure of technology-driven growth.
PPA
Power purchase agreement — a long-term contract to buy electricity, used by hyperscalers to secure nuclear generation.
SMR
Small modular reactor — a compact nuclear reactor design; several are contracted to power data centers.
Capex
Capital expenditure — spending on long-lived physical assets such as data centers and chips.
Task horizon
METR’s measure: the length of task (in human time) an AI can complete at a given reliability.
TPU
Tensor Processing Unit — Google’s custom-designed chips for training and running AI, its alternative to Nvidia’s GPUs.
The Project
Aschenbrenner’s term for a forecast US-government-led national AGI effort. As of May 2026, not realized.
Hyperscaler
A largest-scale cloud operator: Alphabet, Amazon, Meta, Microsoft, Oracle.

Back Matter

Source audit & methodology

Every figure that appears in a chart or as a headline statistic, checked against its origin, with its type and confidence stated. Nothing here is asked to be taken on faith.

The charted-figure audit

FigureKey datumTypeSourceConfidence
Fig 2ARC-AGI-2 <5% → 37.6% (54% scaffolded)FactARC Prize FoundationHigh
Fig 2FrontierMath 25.2%ClaimOpenAI o3 (Epoch re-eval ~10%)Med
Fig 2HLE ~9% → ~48%FactScale AI / GoogleHigh
Fig 3<4 min ~100%; >4 hr <10%FactMETR (Mar 2025)High
Fig 3Intermediate curveShown as unmeasuredn/a
Fig 4$162.3B (2022); $448.3B (2025)FactEpoch AI / SEC filingsHigh
Fig 4$600–725B (2026)ProjectionCompany guidanceMed
Fig 5415 TWh (2024)FactIEA (Apr 2025)High
Fig 5945 TWh (2030)ProjectionIEA (Apr 2025)High
Fig 7≤0.66% TFP / 10 yrProjectionAcemoglu, NBER w32487High
Fig 7~7% GDP liftProjectionGoldman Sachs (Mar 2023)High
Fig 8Experts 47% / public 11%FactPew (Apr 3 2025)High

The excluded-claims register

Equally important is what was kept out. Each of these is widely circulated; each was excluded from every chart for the stated reason.

Excluded claimWhy excluded
DeepSeek had “large volumes of H100s”Anonymous source; Nvidia states H800 not H100; Reuters could not verify.
DeepSeek trained R1 for ~$6M / $294K$294K is the RL stage only; all-in cost ~20× higher; headline disputed.
Unreleased-model benchmark scoresAggregator-invented; no primary source.
HLE >50% as “general reasoning”HLE has ~18% answer-error rate and calibration issues; over-reads the metric.

Back Matter

Full source register

Primary sources for every charted figure and major claim, in order of first appearance.

1Aschenbrenner, L. — Situational Awareness: The Decade Ahead (Jun 2024). situational-awareness.ai
2ARC Prize Foundation — ARC-AGI-2 leaderboard & o3 announcement. arcprize.org/blog/oai-o3-pub-breakthrough
3Scale AI / Google — Humanity’s Last Exam; Gemini 3 results. blog.google · scale.com
4Epoch AI — FrontierMath re-evaluation; Hyperscaler capex trend. epoch.ai/data-insights/hyperscaler-capex-trend
5METR — “Measuring AI Ability to Complete Long Tasks” (Mar 2025). metr.org
6OpenAI — “Announcing the Stargate Project” (Jan 2025). openai.com/index/announcing-the-stargate-project
7SoftBank Group — Stargate capacity update (Sep 2025). group.softbank/en/news/press/20250924
8CNBC — Nvidia single-day market-value loss (27 Jan 2025). cnbc.com
9IEA — Energy and AI (Apr 2025). iea.org/reports/energy-and-ai
10Lawrence Berkeley National Laboratory (DOE) — US data-center energy report (2024). eta.lbl.gov
11US Department of Justice — United States v. Linwei Ding, conviction (29 Jan 2026). justice.gov/opa
12Financial Times — Nvidia chip-smuggling investigation (Jul 2025). ft.com
13RAND Corporation — “Securing AI Model Weights” (May 2024). rand.org/pubs/research_reports/RRA2849-1
14The White House — EO 14110 (Oct 2023); EO 14179 (Jan 2025); Genesis Mission EO (Nov 2025). federalregister.gov
15CSET / Congressional Research Service — export-control chronology (R48642). congress.gov
16TechCrunch — AI Diffusion Rule rescission (13 May 2025). techcrunch.com
17OpenAI — ChatGPT weekly-active-user figures (Oct 2025; Feb 2026). openai.com
18Pew Research Center — US workers and AI (Sep 2025); public-vs-expert (Apr 2025); concern trend (Jun 2025); 25-country survey (2025). pewresearch.org
19MIT Project NANDA — “The GenAI Divide: State of AI in Business 2025” (Jul 2025). media.mit.edu
20Acemoglu, D. — “The Simple Macroeconomics of AI,” NBER w32487 (Aug 2024). nber.org/papers/w32487
21Goldman Sachs — “The Potentially Large Effects of AI on Economic Growth” (Mar 2023). goldmansachs.com
22International Monetary Fund — “AI Will Transform the Global Economy” (Jan 2024). imf.org
23Amodei, D. — “Machines of Loving Grace” (Oct 2024); “The Adolescence of Technology” (Jan 2026). darioamodei.com
24Kokotajlo, D. et al. — AI 2027 (Apr 2025); author recalibration (Nov 2025). ai-2027.com
25Apollo Research — in-context scheming evaluations (Dec 2024). apolloresearch.ai
26OpenAI & Apollo Research — deliberative-alignment anti-scheming results (Sep 2025). openai.com
27Anthropic — “Scaling Monosemanticity” (May 2024). transformer-circuits.pub
28International AI Safety Report (Bengio, chair) — full report (Jan 2025); Key Updates (Oct–Nov 2025); second report (Feb 2026). gov.uk
29Understanding AI (T.B. Lee) — GPT-5 SWE-bench comparison (Aug 2025). understandingai.org
30Stanford HAI — 2025 AI Index Report (inference-cost decline; adoption). hai.stanford.edu/ai-index/2025-ai-index-report
31NIST / US & UK AI Safety Institutes — joint o1 pre-deployment evaluation (Dec 2024); US AISI rebranded CAISI (Jun 2025); summit track Bletchley (2023) · Seoul (2024) · Paris (2025). nist.gov · aisi.gov.uk
32Fortune / Bernstein Research — Nvidia–OpenAI investment and “circular financing” analysis (Sep 2025). fortune.com

URLs are roots of the cited primary material. Where a figure is a projection, the citation is to the projecting body, not to a realized outcome. Access horizon: May 2026.