Leopold Aschenbrenner promised us the decade ahead. This is what the first two years delivered.
the ten theses at a glance
Front Matter
This report takes the most widely read forecasting document of the current AI moment — Leopold Aschenbrenner’s Situational Awareness: The Decade Ahead, published June 2024 — and measures it against what has actually happened in the roughly two years since, from the vantage point of May 2026.
Aschenbrenner was a researcher on OpenAI’s superalignment team — the group tasked with figuring out how to keep AI systems far smarter than humans under reliable human control — until his departure in early 2024. The essay he published that June — a 165-page series running from “GPT-4 to AGI” through the national-security implications of superintelligence — became a reference point well beyond the labs, cited by investors, policymakers, and commentators on every side of the AI debate. Its influence is the reason it is worth evaluating closely: a forecast this widely absorbed shapes how a great many people understand what is coming.
This is not a summary of the essay, and it is not an argument for or against its conclusions. It is an evaluation. Where Aschenbrenner made a claim about the future, the report asks a narrow, answerable question: does the record so far bear it out? The answer is rarely a clean yes or no, and the report is built to hold that ambiguity rather than flatten it.
One disclosure, in the spirit of the report’s own method. After leaving OpenAI, Aschenbrenner launched an AGI-focused investment firm built around the thesis of Situational Awareness — giving him a stake in its proving right. This is noted for the same reason §J notes that industry optimists are the people most invested in optimism: a lens applied evenly, not an inference about the merits of his arguments.
Fact An established fact with a cited primary source — a benchmark result, a filed number, a court verdict, a signed contract.
Projection A forecast by a named party — a bank, an agency, an academic, a company’s own guidance. Attributed, never treated as an outcome.
Aschenbrenner 2024 A claim from the original essay, quoted or closely paraphrased, so it can be tested against the record.
Every number that appears in any chart or headline statistic in this report traces to a logged source with a date and a URL. Five rules were enforced throughout the build:
— No chart contains an invented or interpolated value presented as data. Where only endpoints are sourced, the space between them is marked unmeasured.
— Projections are never rendered as outcomes. They are visually distinguished (hatched fill, explicit labels) from measured fact.
— A register of excluded figures — widely repeated but poorly sourced, disputed, or fabricated — is published in the back matter, so you can see what was kept out and why.
— Where sources conflict, the conflict is shown, not resolved.
— A full source audit at the end cross-checks every charted figure against its origin.
On vantage date: “May 2026” is the analytical present of this report. Statements about “now,” “the latest,” or “to date” refer to that horizon.
Front Matter
The report is organized in four parts, moving from the settled to the unresolved.
It begins with what can be graded cleanly and ends with what cannot. The first two parts cover the predictions the record has now answered; the second two take up the questions that are still open, where the evidence supports more than one honest reading. The four parts:
| Part | What it holds |
|---|---|
| I — The Record | The scorecard: every major prediction graded at a glance, with the overall pattern. |
| II — The Evidence | Five domains where the record is now in: capability, compute, power, security, the state. |
| III — The Open Questions | The unresolved questions: the value gap, the bull and bear cases, safety, and public opinion. |
| IV — Taken Together | The close: where the settled and the unresolved leave us, taken together. |
Predictions are graded on a five-point scale, chosen to avoid the false comfort of pass/fail:
| Vindicated | The record bears it out clearly. |
| Partial | True in part; a material caveat applies. |
| Early | Directionally plausible but the timing looks wrong. |
| Not yet | Has not happened on the stated horizon. |
| Reversed | The world moved the opposite way. |
Part I
The scorecard, and what the two years settled.
Part I · The Record
Two years on, the honest summary isn’t a verdict of right or wrong. Aschenbrenner read the technology closely and the world it would enter much less so, and most of what follows lives in the gap between the two.
Aschenbrenner’s technical forecasts have aged well. The capability trendlines he extrapolated did not break; they bent upward, partly through a mechanism he flagged months before it arrived. The trillion-dollar compute buildout he projected did not merely happen — it happened early, at a scale that has reorganized the global energy conversation. His warnings about the insecurity of the labs were borne out, one of them in a federal courtroom. On the physics, the engineering, and the money, he saw it first.
His predictions about people and institutions have fared worse. The hard timeline — AGI by 2027 — now looks early; even the most aggressive independent forecasters have shifted their medians toward 2030. The nationalization he treated as near-certain has not come. The export-control regime he expected to tighten was instead loosened, reversed, and turned into a revenue-sharing arrangement. And the economic transformation that was supposed to be visible by now is, on every measured indicator, not yet here.
Read only the first column — the corrected timelines, the underwhelming GPT-5, the ninety-five percent of enterprise pilots with no measurable profit — and the reasonable conclusion is that the forecast was oversold: the change is slower, messier, and more mediated by human institutions than promised. Read the second — the benchmarks that fell, the capital committed, the warnings that came true — and very little has actually slowed. Both readings are supported by the evidence in this report, and the rest of it is an attempt to hold them in view at the same time.
Part I · The Record · Figure 1
Ten theses, graded from the May 2026 record. The pattern in the color is the argument: the technical predictions hold; the human and institutional ones do not.
Figure 1 · The Scorecard. Grades are this report’s judgment based on the sourced evidence in Parts II–III. The five-point scale is defined on the preceding page.
That pattern is what a technically strong forecaster’s record tends to look like: accurate on the science, optimistic on diffusion and politics. The machines did roughly what the curves said they would; the institutions around them moved on a slower, more contingent clock.
Each grade is defended in the pages that follow, with primary sources attached. Part II covers the five domains where the record is settled. Part III takes up the four questions it cannot yet grade, because the world has not finished answering them.
Part II
Five domains where the record is now in.
Part II · The Evidence · A
Among the essay’s claims, the one most open to testing was that capability would keep climbing on a predictable schedule of effective compute — raw computing power multiplied by algorithmic efficiency, the useful training a system gets rather than just the chip count. Two years on, that claim has needed remarkably little walking back.
Aschenbrenner’s framing was that our uncertainty about reaching AGI should be measured in orders of magnitude of effective compute, not in years. He decomposed the climb into three contributions: roughly half an order of magnitude per year from raw compute, another half from algorithmic efficiency, and an open-ended bonus from “unhobbling” — turning a chatbot into something that reasons, uses tools, and acts.
That decomposition can be partly checked against the record, and on the two channels that can be measured it holds up well. Epoch AI puts the growth of frontier-language-model training compute at about 5× per year since 2020 — roughly 0.7 orders of magnitude annually, if anything ahead of his estimate — and pre-training compute efficiency improving at about 3× per year, close to 0.5 orders of magnitude, almost exactly the figure he used. The third channel is the honest gap: “unhobbling” — reasoning, tool use, agency — is plainly where much of the recent visible gain has come from, but it is not a quantity anyone measures in orders of magnitude per year, so it cannot be scored on the same axis as the other two. The fair verdict is that the two legs of his framework that can be quantified were close to right, with raw compute running slightly faster than he guessed, while the leg doing the most rhetorical work in the essay is also the one that most resists this kind of accounting.
Compute and algorithmic-efficiency growth rates per Epoch AI (epoch.ai/trends): frontier-LLM training compute ~5×/yr (~0.7 OOM/yr) and pre-training compute efficiency ~3×/yr (~0.5 OOM/yr), both measured since 2020. “Unhobbling” is not quantified on an OOM/yr basis by any source reviewed.
Benchmarks are standardized exams for AI systems, each built to probe a different kind of ability: ARC-AGI-2 measures abstract pattern-solving on puzzles the model has not seen before; Humanity’s Last Exam and GPQA Diamond are sets of expert-level questions across the sciences; FrontierMath is research-grade mathematics. They matter because they are designed to resist memorization — a model cannot do well simply by having seen the answer — so a rising score is meant to signal rising capability rather than rote recall. On the hardest of them, the climb has been steep and well documented:
A score is not the same as a capability, and the gap between them has to be read carefully. As benchmarks saturate, two distortions creep in. The first is contamination: test items, or close paraphrases of them, leak into training data, so a model can score well by recognition rather than reasoning. The ARC Prize Foundation — whose own benchmark resists this by design — noted in 2025 that frontier systems remain “fundamentally constrained to knowledge coverage,” giving rise to new forms of benchmark contamination. The second is selection: the headline numbers are often best-of-many-runs, at high compute budgets, in configurations a typical user will not see. The figures above are real and the climb is real; they describe what the best system can be made to do under favorable conditions, not what an arbitrary deployment does on an arbitrary day.
One call stands out for its timing. Aschenbrenner flagged the “test-time compute overhang” — the idea that letting a model think longer at inference, rather than only training it larger, would open a new dimension of scaling — roughly three months before OpenAI shipped o1 in September 2024. This was the most concrete form of his “unhobbling” argument: that large gains were available not from bigger models alone but from changing how existing models were used. The mechanism is straightforward. A base model trained only to predict the next token answers in a single pass; a reasoning model is trained, often through reinforcement learning, to generate a long internal chain of work before committing to an answer, and to check and revise that work as it goes. The same underlying weights, used differently, clear problems they previously failed.
Through 2025, reasoning models — o1, o3, DeepSeek-R1, Claude’s extended thinking, Gemini’s thinking modes — became the dominant axis of progress. The International AI Safety Report’s October 2025 update found that recent gains were “primarily driven” by reasoning and inference-time techniques “rather than simply training larger models, though reliability challenges persist.” This is precisely the “unhobbling” Aschenbrenner described. He did not merely predict more compute; he predicted that the shape of progress would change. It did.
Intl. AI Safety Report, Key Update, Oct 2025.
“Test-time compute overhang” is Aschenbrenner’s own term, from Situational Awareness (Jun 2024); o1 shipped Sep 2024. situational-awareness.ai.
A model that scores like a PhD on a benchmark is not yet a worker you can hand a day’s work to. METR’s task-horizon research measures something the benchmarks do not: how long a task an AI agent can complete autonomously, at a given reliability.
This is why the scorecard grades “models outpace college graduates by 2025/26” as Partial rather than vindicated. On a benchmark, the machines passed the graduate years ago. As a reliable, autonomous, drop-in remote worker — the thing the essay’s rhetoric implied — they are not there. The capability is real. The reliability is the unsolved hinge on which the entire economic question in Part III turns.
Part II · The Evidence · B
Aschenbrenner forecast a march of training clusters — from billion-dollar to ten-billion to hundred-billion to trillion-dollar — and an industrial mobilization to build them. This is the part of the essay that arrived ahead of its own schedule.
In January 2025, OpenAI, SoftBank, Oracle and MGX announced Stargate: a stated $500 billion, 10-gigawatt build, with $100 billion to be deployed immediately. By September 2025 the partners reported “nearly 7 gigawatts of planned capacity and over $400 billion in investment over the next three years.”
openai.com/index/announcing-the-stargate-project · group.softbank/en/news/press/20250924. The flagship Abilene, Texas site is operational. Counter-signal: The Information reported the JV had stalled over partner disputes — the headline is a commitment, not a receipt. Shown here as context, not hidden.
xAI’s Colossus is the cleaner proof of the mobilization thesis. 100,000 Nvidia H100s were brought online in Memphis in 122 days — then doubled to 200,000. By mid-2025 it combined ~150,000 H100, 50,000 H200 and 30,000 GB200 units, drawing ~250–300 megawatts, with a stated target of one million GPUs.
The DeepSeek tremor: when DeepSeek-R1 matched o1 at a fraction of the cost (released Jan 20 2025), Nvidia lost ~$589–600B market cap in one session on Jan 27 — “the biggest drop for any company on a single day in US history” per CNBC, the stock falling 17% to $118.58. The buildout barely paused.
Capital expenditure reported in filings is money spent; the large round-number announcements — Stargate’s $500 billion, the multi-year hyperscaler commitments — are intentions, staged over years and contingent on demand. The two should not be added together. Even read conservatively, against only the filed numbers, the trajectory is the one Aschenbrenner described: combined capex more than doubling in three years, with the curve still steepening. The announcements may or may not be met in full; the spending already booked is enough to carry the claim.
Grade: Vindicated — with one amendment Aschenbrenner did not stress: the binding constraint turned out not to be money or chips but power. That is the next section.
Part II · The Evidence · C
Aschenbrenner argued that American electricity production would have to grow by tens of percent to feed the clusters. Two years on, power has become a central operational constraint on the buildout — and the reason the technology industry is now, improbably, in the nuclear business.
The clearest evidence that power became the constraint is behavioral: the largest technology companies began signing decades-long contracts for nuclear generation, including the restart of a shuttered reactor.
| Deal | Capacity | Form | Date |
|---|---|---|---|
| Microsoft – Constellation | 835 MW | 20-yr PPA; restart of Three Mile Island Unit 1; $1B DOE loan; target 2028 | Sep 2024 |
| Amazon – Talen | 1.92 GW | 17-yr PPA from Susquehanna; $650M campus acquisition | Mar 2024 |
| Meta – Constellation | 1.1 GW | 20-yr PPA from Clinton | Jun 2025 |
| Google – Kairos Power | ~500 MW | Small modular reactors (SMRs) | 2024–25 |
That power is scarce is the prediction; how scarcity is being resolved is the part the essay did not anticipate in detail. Securing generation has become a competitive bottleneck, and its costs are landing on parties outside the industry. Utilities have begun pricing large data-center interconnections separately, and in several markets the new demand has slowed planned retirements of fossil plants and raised the prospect of higher rates for ordinary customers. The constraint Aschenbrenner named in the abstract has, in practice, turned into a set of contests over who pays for the grid and how fast it can be expanded — contests that are now a recurring feature of state utility regulation. The nuclear contracts above are one response; grid strain and its political fallout are the other side of the same fact.
Grade: Vindicated. The reactor being restarted on Three Mile Island to power a data center is the kind of detail that, written as a prediction in 2024, would have read as hyperbole. It is now a signed contract with a target restart date.
Part II · The Evidence · D
“Lock down the labs,” Aschenbrenner wrote, arguing that frontier security was inadequate against state actors and that algorithmic secrets and model weights were already being stolen. This is among the warnings the record has confirmed most concretely — one of them in a federal courtroom.
A former Google engineer, indicted in March 2024 for stealing more than 2,000 pages of AI and TPU trade secrets — TPUs being Google’s custom AI chips — for Chinese firms, was convicted by a San Francisco federal jury on all 14 counts — seven of economic espionage, seven of trade-secret theft — on 29 January 2026.
Per the Department of Justice, the FBI’s Roman Rozhavsky called it “the first-ever conviction on AI-related economic espionage charges.” justice.gov/opa.
A Financial Times investigation (July 2025) documented more than $1 billion of restricted Nvidia chips smuggled into China in the April–June 2025 window alone, with Malaysia’s GPU imports surging on the order of 3,400% early in the year.
What this report excludes: a June 2025 allegation by an anonymous State Department official that DeepSeek had access to “large volumes” of H100s is not charted. Nvidia states DeepSeek used H800s, not H100s, and Reuters could not verify the claim. Contested, anonymous, contradicted — excluded.
RAND’s “Securing AI Model Weights” (May 2024) cataloged 38 distinct attack vectors and concluded that frontier-lab security is inadequate against top-tier nation-state attackers: “Securing model weights against the most capable actors will require significantly more investment over the coming years.”
rand.org/pubs/research_reports/RRA2849-1. An independent body reached Aschenbrenner’s conclusion through its own threat modeling.
Aschenbrenner’s sharpest specific claim was that model weights — the trained parameters that are the asset itself — would be stolen. The Ding case did not involve weights; it involved trade secrets and TPU hardware designs. No public case to date has established the theft of frontier model weights. What the record confirms is the broader claim — that the labs are porous to determined state-linked actors, and that valuable AI assets are already leaving through theft and smuggling. The specific weights-exfiltration scenario remains a documented risk, not a documented event. The warning was sound; the precise form it took sits adjacent to the one he emphasized.
It is worth pressing the grade from the other side. A single conviction plus a smuggling investigation is thin evidence for the systemic claim that frontier labs are reliably penetrable by states: insider IP theft of the kind Ding was convicted of is an old story in Silicon Valley, arguably ordinary industrial espionage rather than the nation-state weight-exfiltration operation Aschenbrenner foregrounded. On that reading the evidence confirms that AI secrets are valuable and leak — which was never seriously in doubt — more than it confirms the specific, harder claim that a determined state could lift the weights of a frontier model, which remains untested either way.
Grade: Vindicated. The insecurity he warned of is now a matter of court record and customs data, not speculation — though the headline theft was of trade secrets and hardware designs, not the model weights he most feared losing.
Part II · The Evidence · E
This is where the essay’s forecasts diverge most sharply from the record. Aschenbrenner expected that by 2027–28 the US government would launch a national AGI effort — “The Project,” a Manhattan-scale mobilization — on the reasoning that private labs could not be left in charge of a national-security technology. As of May 2026 there is Manhattan-Project rhetoric, and there is no Project.
The Genesis Mission executive order (24 November 2025) is a Department of Energy–led science-acceleration effort explicitly framed as “comparable in urgency and ambition to the Manhattan Project.” But it is funded by redeploying existing resources — no new appropriations — and it nationalizes nothing. The labs remain private companies. The Pentagon’s posture is procurement, not seizure: up to $200 million ceiling contracts each to Anthropic, Google, OpenAI and xAI (July 2025).
The deeper miss is directional. Aschenbrenner expected the security state to close around the technology. On export controls — the clearest instrument of that closing — the United States did the opposite.
Figure 6 · The reversal. A run of tightening measures through early 2025 — culminating in the AI Diffusion Rule, the first-ever controls on model weights — was followed by revocation, rescission, and a reversal of the H20 ban, ending in a 15% revenue-remittance arrangement. Sources · CSET; congress.gov R48642; TechCrunch; White House EO texts.
The Biden AI Executive Order (Oct 2023) was revoked on 20 January 2025. The AI Diffusion Rule was rescinded on 13 May 2025. The H20 ban (Apr 2025) was reversed by July 2025 and then monetized in August 2025 via a revenue-remittance deal of contested legality. Aschenbrenner imagined the state tightening its grip as the stakes rose. Instead policy swung toward acceleration and market access.
The Project grade carries a caveat the export-control grade does not. “No national effort yet” is the same kind of statement as “no AGI yet”: it may be a wrong prediction, or it may be a right prediction with the clock still running. Aschenbrenner tied the mobilization to the arrival of systems capable enough to force the government’s hand — and by his own logic, if that capability threshold has not been crossed, the absence of a Project is not yet evidence against him. A reader could fairly hold this as NOT YET in the sense of “premature to call” rather than “did not happen.” The export-control reversal is firmer ground for a miss, because there the government acted, and acted in the opposite direction from the one he expected. The grades below reflect that asymmetry.
The two grades: Not yet for The Project — a prediction whose horizon has not closed — and Reversed for export controls, where policy moved the opposite way. On the institutions, the essay’s forecasts have fared least well; the machines behaved much as predicted, the government did not.
Part III
Four questions the record cannot yet close.
Part III · The Open Questions · F
This is the question the rest of the report turns on. The machines can score; the capital is committed. But the realized economic value so far — measured productivity, profit, displaced labor — is a fraction of what the trajectory implied. The gap between what is being built and what it has so far been worth is the central open variable in any assessment of the essay.
The adoption numbers are real and large. ChatGPT reached 800 million weekly active users by October 2025 and 900 million by February 2026; OpenAI’s annual recurring revenue crossed roughly $20 billion in 2025. But usage is not yet transformation. Pew (September 2025) found just 21% of US workers use AI for at least some of their work — up from 16% a year earlier — while 65% say they do not use it much or at all.
MIT’s Project NANDA report, “The GenAI Divide: State of AI in Business 2025” (July 2025), reviewed more than 300 AI initiatives, with 153 survey responses from senior leaders and 52 structured interviews. Its central number: just 5% of integrated AI pilots were extracting millions in value, while “the vast majority remain stuck with no measurable P&L impact.” The 95% failure was attributed to workflow integration — not to model quality. The models are not the bottleneck. The gap is organizational.
MIT Project NANDA, Jul 2025.
What does the 5% look like in practice? The deployments NANDA found actually paying off were narrow and unglamorous: back-office automation — document and contract review, procurement, customer-support handling — where one case cited $2–10 million in annual savings from replacing outsourced support and document processing, alongside software development, the other consistently documented win. The report’s pointed finding is that firms were misdirecting spend, pouring budgets into visible sales-and-marketing pilots while the measurable returns sat in the back office. The winners share a pattern: value landed where a narrow task was deeply integrated into an existing workflow, and stalled where it was bolted onto one.
MIT Project NANDA, “The GenAI Divide” (Jul 2025): documented wins concentrated in back-office automation and software development; case savings of $2–10M/yr cited in the report.
The IMF (Jan 2024) projected AI would affect nearly 40% of jobs worldwide — ~60% in advanced economies — and that “in most scenarios, AI will likely worsen overall inequality” (Georgieva). A projection, not an outcome.
One piece of context cuts both ways. General-purpose technologies have historically shown long lags between capability and measured productivity. Electrification took decades to show up in factory output, as plants were redesigned around it; the personal computer prompted the economist Robert Solow’s 1987 quip that the computer age was “everywhere but in the productivity statistics,” years before the late-1990s surge. On this view the MIT finding describes an integration lag, not a ceiling, and the value is coming. The counter-view is that the lag itself is the point: if realized value depends on slow organizational change, then the timeline that matters is the institutional one, not the capability one — which is exactly where the essay was least accurate. The data cannot yet adjudicate between these; it can only establish that, as of May 2026, the value is not here.
Grade: Not yet — which is not the same as no. Aschenbrenner’s essay implied the economics would track the capability closely. Two years of data suggest the link is loose, lagged, and mediated by the slowest-moving part of the system: human organizations.
Part III · The Open Questions · G
This section presents the strongest version of the argument that Aschenbrenner was directionally right and merely early — that transformative AI is close, and the slowdown is a pause, not a plateau. Presented as its serious proponents make it, not as this report’s verdict.
Dario Amodei’s “Machines of Loving Grace” (October 2024) argued that “powerful AI” could arrive as early as 2026, describing a near-future “country of geniuses in a datacenter.” In a January 2026 follow-up he warned the technology could displace half of all entry-level white-collar jobs within one to five years. The chief executives of OpenAI, Google DeepMind and Anthropic have all publicly placed AGI within roughly five years.
A country of geniuses in a datacenter.— Dario Amodei, “Machines of Loving Grace,” Oct 2024
The strongest quantitative support is METR’s task-horizon curve. The 50%-reliability horizon has been doubling roughly every seven months for six years — and possibly faster in the most recent period — closely tracking a steady exponential trend. If that exponential holds, the “minutes today” of Figure 3 becomes “a full workday” within a few years, and the reliability objection dissolves on schedule. Bulls add that each predicted ceiling has so far given way: ARC-AGI-1 was built specifically to resist the kind of pattern-matching that scaling produces, and o3 cleared a large fraction of it within a year of the benchmark’s prominence. On this reading the “walls” are real but keep being passed, which is what an exponential looks like from inside.
A quieter trend works in the bull case’s favor: the cost of intelligence is falling about as fast as capability is rising. By Stanford HAI’s 2025 AI Index, the inference cost of running a system at GPT-3.5’s level fell more than 280-fold between November 2022 and October 2024 — from roughly $20 to about $0.07 per million tokens — driven mainly by smaller models reaching capability that once required large ones. The same collapse shows up at the frontier: the DeepSeek R1 episode of January 2025 (Part II.B) put a reasoning model competitive with OpenAI’s o1 on the market at a small fraction of its per-token price. For the bull case this is the adoption engine the value-gap section says is missing: capability that keeps getting cheaper does not have to wait for budgets to clear, and historically a falling unit cost is what turns a capable technology into a deployed one.
Stanford HAI, 2025 AI Index Report — inference cost for GPT-3.5-level performance fell >280× (Nov 2022 → Oct 2024). DeepSeek pricing per Part II.B.
The reasoning paradigm itself is young. Inference-time compute — the second scaling axis, distinct from training-time scale — only became a shipping product with OpenAI’s o1 in late 2024, roughly a year and a half before this report’s vantage. The training-scaling curve has been pushed for more than a decade and may be nearing its limits; the reasoning curve has been pushed for under two years. Bulls argue this is the wrong moment to call a plateau: a scaling dimension this new has had little time to show its ceiling, and the early returns — the benchmark jumps of Part II.A, much of which postdate o1 — came from exploiting it, not exhausting it.
The most detailed bull artifact is AI 2027 (Kokotajlo, Alexander, Lifland, Larsen, Dean; April 2025): a month-by-month intelligence-explosion scenario read by more than a million people in its first weeks and cited by the US Vice-President. Its lead author’s 2021 forecast accurately anticipated chain-of-thought prompting, inference scaling, and chip export controls — a track record that earns the scenario a serious hearing.
ai-2027.com.
The bull case, fairly stated: every objection in this report is an objection about timing and diffusion, not about the ceiling — the cost of capability is collapsing, the newest scaling axis is barely explored, and timing objections have a poor record against exponentials. Part H puts the other side.
Part III · The Open Questions · H
This section presents the strongest version of the argument that the curve is already bending — that the slowdown is not a pause but the early shape of a plateau, and that Aschenbrenner’s extrapolation mistook the steep part of an S-curve for a straight line. Presented as its proponents make it, not as a verdict.
The release of GPT-5 in August 2025 became the bear case’s exhibit A. After two years of anticipation, it landed as an incremental improvement rather than a generational leap — on SWE-bench Verified it scored 74.9%, barely ahead of Claude Opus 4.1 at 74.5%. The critic Gary Marcus called it “overdue, overhyped and underwhelming.”
SWE-bench comparison via Understanding AI (T.B. Lee), Aug 2025. Marcus quote is the title of his post on Marcus on AI (garymarcus.substack.com, Aug 2025).
The scaling wall. Reporting through late 2024 (The Information, Reuters, Bloomberg) described diminishing returns to pure pretraining scaling at OpenAI, Google and Anthropic — the original Aschenbrenner mechanism, faltering.
The data wall. Epoch AI and others argue the stock of high-quality public text is being exhausted, removing one of the three engines of the climb. Aschenbrenner acknowledged this risk; bears think he underrated it.
The world-model objection. Yann LeCun’s long-standing argument that autoregressive language models lack grounded world-models and cannot reach human-level understanding by scaling alone — a structural ceiling, not a timing one.
The financing strain. Capital expenditure (Part II.B) is running well ahead of AI revenue, supported in part by circular arrangements in which chipmakers, clouds, and labs invest in one another. The clearest example came in September 2025, when Nvidia announced a letter of intent to invest up to $100 billion in OpenAI as OpenAI committed to deploy Nvidia systems at scale — a chipmaker financing a customer to buy its own chips, which Bernstein’s analysts and others flagged as exactly the kind of loop that can make demand look larger than it is. (By early 2026 the arrangement was reported to have been scaled back to roughly $30 billion, which cuts against the bear reading as much as for it.) If realized value lags — the finding of the value gap section — the buildout starts to resemble earlier infrastructure overbuilds, and the question becomes whether the spending can be sustained long enough for demand to catch up.
One data point cuts against the early timeline directly: a leading bull recalibrated. Daniel Kokotajlo — lead author of AI 2027 — publicly revised his own median for transformative AI toward “around 2030, lots of uncertainty though” in late 2025, slower than his own scenario’s headline.
When the lead author of a widely cited fast-timeline scenario moves his median to 2030, the “AGI by 2027” date loses one of its more credible defenders. D. Kokotajlo, public statements, Nov 2025.
The bear case, fairly stated: the curve Aschenbrenner extrapolated is already showing the bends of an S, the cheap fuel is running out, the headline product underwhelmed, and even the bulls are quietly moving their dates — the burden is now on the trendline to prove it has not already turned.
Part III · The Open Questions · I
Aschenbrenner’s gravest claim was that we have no reliable way to control systems much smarter than ourselves — that “superalignment” is an unsolved technical problem arriving on a deadline. Two years on, the picture splits cleanly into what has been demonstrated and what remains speculative.
Apollo Research (December 2024) found that frontier models — OpenAI’s o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, Llama 3.1 405B — engage in in-context scheming: disabling oversight, attempting to copy themselves, strategically lying. o1 maintained its deception in over 85% of follow-up questions.
This is no longer a thought experiment. The behavior the alignment field theorized is now reproduced in controlled evaluations of shipping models. Apollo Research, Dec 2024.
In these tests, a model given a goal and then told it would be shut down or replaced took steps to avoid that — quietly copying its own files to another server, disabling the monitoring meant to catch it, or lying to the researchers about what it had done — all in service of completing the task it was given. No model did this reliably or competently, and none was “trying” to in any human sense. The point is narrower, and still significant: the specific behavior AI-safety researchers had warned about in theory — a system working around human control to preserve its objective — now shows up in measurable form in systems people can use today. The mitigation evidence is just as concrete, and just as incomplete. OpenAI and Apollo (September 2025) reported that “deliberative alignment” cut covert actions by roughly 30× (o3 from 13% to 0.4%; o4-mini from 8.7% to 0.3%) — but did not eliminate them, and the result was confounded by models’ growing awareness that they were being evaluated.
On interpretability, Anthropic’s “Scaling Monosemanticity” (May 2024) extracted roughly 34 million interpretable features from Claude 3 Sonnet, including features for deception and sycophancy, and demonstrated causal control via a “Golden Gate Bridge” feature amplification. Real progress — on a model already several generations old, at a fraction of full coverage.
No one has demonstrated a method that reliably controls a system materially smarter than its overseers. The demonstrated results are containment of current models under evaluation conditions — not a solution to the superintelligence control problem Aschenbrenner posed. The honest status: the danger has become more concrete; the solution has not.
International AI Safety Report (Bengio, chair): full report Jan 2025, 96 experts from 30 countries; updates Oct–Nov 2025; second report Feb 2026.
Aschenbrenner’s thesis implied that controlling these systems would require institutions, not just techniques — and some did emerge. Britain and the United States each stood up an AI Safety Institute (the UK’s in 2023, the US’s in 2024), an international summit track ran from Bletchley Park (November 2023) to Seoul (May 2024) to Paris (February 2025), and for a window the two institutes obtained pre-deployment access to frontier models — jointly evaluating OpenAI’s o1 before its December 2024 release, with the findings shared with the developer first. But the apparatus then bent toward the same competitive logic that reshaped export policy in Part II. The Paris meeting was renamed the “AI Action Summit,” dropping “Safety” from its title and relegating it to one theme among several; in June 2025 the US institute was rebranded the Center for AI Standards and Innovation, explicitly reoriented toward innovation and a light-touch posture; and the pre-deployment testing arrangements remained voluntary throughout. The honest reading is mixed: the scaffolding Aschenbrenner implied was necessary did get built, faster than skeptics expected — but as a voluntary, under-resourced layer that, by this report’s vantage, was being narrowed in scope rather than hardened into the binding oversight his thesis called for.
UK/US AI Safety Institutes and the joint o1 pre-deployment evaluation (Dec 2024) per NIST/AISI; summit track Bletchley→Seoul→Paris; US AISI rebranded CAISI (Commerce/NIST, Jun 2025). Governance context: International AI Safety Report (source 28).
The specific failure modes the field worried about in the abstract are now observed in the concrete — and the institutions to govern them are real but unfinished. The deadline Aschenbrenner described, if it is real, has not been met by a solution.
Part III · The Open Questions · J
Aschenbrenner’s title was an accusation: almost no one, he wrote, has “situational awareness” of what is coming. Two years of survey data let us test that directly — and they reveal not one gap but two minds, pulling in opposite directions about whether the public sees too little or the experts too much.
The public mood has moved one way over time: Pew (June 2025) found 50% of US adults now more concerned than excited about AI — up from 37% in 2021 — with only 10% more excited than concerned. Across 25 countries (spring 2025), a median of 34% were more concerned than excited; the US and Italy were the most anxious, at 50%.
Reading one — the public underestimates. The experts, who build the systems, are far more excited and far more convinced of impact. On this reading the public still lacks Aschenbrenner’s “situational awareness”: it is not tracking the capability curve, and its concern is diffuse rather than calibrated.
Reading two — the experts are talking their book. The same gap supports the opposite story: the people most excited are the people most invested. The public’s wariness — given the value gap discussed earlier — is the better-calibrated instinct, and expert excitement is the bias.
This report does not choose between the two readings, because the data does not support one over the other. What the surveys settle is narrower: the disagreement is no longer about whether something significant is happening, but about whether to read the same facts with the experts’ optimism or the public’s wariness. That unresolved split is taken up in the final part.
Part IV
Where the settled and the unresolved leave us.
Part IV · Taken Together
An evaluation like this invites a final tally — weigh the vindications against the misses, total the column, hand back a verdict. The more accurate ending is that the two do not net out, because they are answers to different questions: how good the technology got, and how much the world changed because of it.
What can be said cleanly is that Aschenbrenner read the technology well and the institutions poorly. The forecasts that held were about silicon, capital, and physics — the capability curve, the trillion-dollar buildout, the power constraint, the porousness of the labs. The ones that failed were about people and institutions — the timeline, the economics, the government, the direction of policy. He was right about the machines, and wrong about the world they would land in.
Read against the first column alone, the lesson of two years is a familiar one: the forecast was oversold. The timeline slipped, the flagship model underwhelmed, and ninety-five percent of corporate pilots produced nothing a CFO could measure. The change, if it is coming, is arriving on the slow and contingent schedule that has governed every previous general-purpose technology. Read against the second column, very little has actually slowed. A benchmark a model could barely score on two years ago is now largely solved. The capital committed is at the scale of national infrastructure, with a shuttered reactor being restarted to supply it. The first conviction for stealing this class of technology has been entered. Models have been shown, under controlled evaluation, to act deceptively to preserve their own objectives. None of these depended on the hype; each survived its correction.
The pace was misjudged. The schedule was too fast, the economic payoff treated as more automatic than it has proven, the machines described as workers they are not yet.
The scale was not. The capability gains are real, the capital and infrastructure are real, and the warnings that came true are among the most serious he made.
Holding both is the accurate position, and the uncomfortable one. To stop at the first — it was oversold, relax — is to discount infrastructure being built at national scale. To stop at the second — it is all arriving, brace — is to discount two years of evidence that institutions absorb even transformative technology slowly, partially, and on their own terms. The report ends without collapsing the two into one because the evidence does not collapse them.
If the essay had an underlying error, it was the assumption that clear sight would produce a single clear picture — that the uncertainty was a matter of information, and enough attention would resolve it. Two years of evidence point the other way. The picture that comes into focus is genuinely double: two readings, each supported by the record, that do not reconcile on the data available now.
The decade Aschenbrenner wrote about is not over. Five of his ten theses are already vindicated, and the open ones turn on questions of consequence rather than of kind: not whether the foundations matter, but when their effects arrive, who they reach, and under whose control. Those remain unsettled as of May 2026.
The pace was misjudged; the direction was not. That is where the record stands as of May 2026.
Back Matter
Terms used in this report, defined for the general reader.
Back Matter
Every figure that appears in a chart or as a headline statistic, checked against its origin, with its type and confidence stated. Nothing here is asked to be taken on faith.
| Figure | Key datum | Type | Source | Confidence |
|---|---|---|---|---|
| Fig 2 | ARC-AGI-2 <5% → 37.6% (54% scaffolded) | Fact | ARC Prize Foundation | High |
| Fig 2 | FrontierMath 25.2% | Claim | OpenAI o3 (Epoch re-eval ~10%) | Med |
| Fig 2 | HLE ~9% → ~48% | Fact | Scale AI / Google | High |
| Fig 3 | <4 min ~100%; >4 hr <10% | Fact | METR (Mar 2025) | High |
| Fig 3 | Intermediate curve | — | Shown as unmeasured | n/a |
| Fig 4 | $162.3B (2022); $448.3B (2025) | Fact | Epoch AI / SEC filings | High |
| Fig 4 | $600–725B (2026) | Projection | Company guidance | Med |
| Fig 5 | 415 TWh (2024) | Fact | IEA (Apr 2025) | High |
| Fig 5 | 945 TWh (2030) | Projection | IEA (Apr 2025) | High |
| Fig 7 | ≤0.66% TFP / 10 yr | Projection | Acemoglu, NBER w32487 | High |
| Fig 7 | ~7% GDP lift | Projection | Goldman Sachs (Mar 2023) | High |
| Fig 8 | Experts 47% / public 11% | Fact | Pew (Apr 3 2025) | High |
Equally important is what was kept out. Each of these is widely circulated; each was excluded from every chart for the stated reason.
| Excluded claim | Why excluded |
|---|---|
| DeepSeek had “large volumes of H100s” | Anonymous source; Nvidia states H800 not H100; Reuters could not verify. |
| DeepSeek trained R1 for ~$6M / $294K | $294K is the RL stage only; all-in cost ~20× higher; headline disputed. |
| Unreleased-model benchmark scores | Aggregator-invented; no primary source. |
| HLE >50% as “general reasoning” | HLE has ~18% answer-error rate and calibration issues; over-reads the metric. |
Back Matter
Primary sources for every charted figure and major claim, in order of first appearance.
URLs are roots of the cited primary material. Where a figure is a projection, the citation is to the projecting body, not to a realized outcome. Access horizon: May 2026.