The data bottleneck behind AI progress
June 20, 2026 · 8:19 AM

The data bottleneck behind AI progress

Dwarkesh Patel argues that recent AI progress is powered less by a leap in sample efficiency than by a vast, expensive expansion of task-specific data. The article explains why that critique matters, why it may still support commercial automation, and what question it leaves open for AI research.

The episode's sharpest claim is not that today's models are weak. It is that they are strong in a suspiciously expensive way. In a new Dwarkesh Patel essay-podcast, the central bottleneck is sample efficiency: how much data a system needs before it can perform fluently in a domain 1. Patel's answer is uncomfortable for anyone betting on clean, self-propelled intelligence: recent progress may have come less from models learning more like humans and more from widening the data distribution they are trained on 1.
Loading content card…

The argument: progress is being bought with data

Patel frames reinforcement learning as a kind of synthetic data factory. A lab spends compute against a verifier, rubric, or judge, searches for good rollouts, then trains the model to predict those successful rollouts much as pretraining trains a model to predict internet text 1. That process still needs the model to assign some prior probability to a good solution, which is why expert trajectories remain so important 1.
The less glamorous part of the AI boom, in this telling, is the expert-data labor stack. Patel points to data work that looks extremely specific: specialists polishing Word files, legal experts writing realistic M&A diligence or securities filings, and consultants producing market-research templates 1. Each desired skill may require hundreds of experts generating completions, rubrics, and chains of thought; Patel says the industry producing these labels and RL environments is already earning billions in revenue and may soon reach the tens of billions 1.
Dwarkesh Patel episode thumbnail
The episode asks what is really driving AI progress: the model architecture, or the hidden data work underneath it 1.
That leads to the most useful mental model in the episode. Frontier systems can look like a "galaxy glittering with capabilities," Patel says, but the force holding it together is an "unimaginably massive black hole of data" 1. The metaphor matters because it shifts attention away from model demos and toward the hidden data machinery under them.

The human comparison is brutal

Patel's comparison between humans and models is not subtle. A person hearing and seeing about 2,000 words per hour would encounter roughly 200 million language tokens from birth to adulthood, while frontier models are trained on tens to hundreds of trillions of tokens 1. He calls that close to a million-fold difference 1.
DomainHuman learning exampleAI comparison Patel draws
LanguageRoughly 200 million language tokens from birth to adulthoodFrontier models see tens to hundreds of trillions of tokens 1
RoboticsA person can learn to teleoperate a new humanoid or robot arm within hoursMillions of robot demonstration hours still have not made AI robust at complex open-ended tasks 1
DrivingA teenager can learn to drive with about 20 hours of practiceWaymo and Tesla need orders of magnitude more data to train self-driving systems 1
The point is not that models are useless. It is that their learning curve appears to sit on a different plane. Humans can generalize from sparse experience in ways current models do not. For AI labs, that creates a practical strategy: if the model cannot learn as efficiently, surround it with enough task-specific data that the task no longer sits outside distribution.
AI training data network
AI-generated illustration of the hidden training-data machinery that lets a system behave competently in many narrow domains.

Patel's objections to the easy objections

The episode is strongest when Patel anticipates the standard pushbacks. The first is evolution: maybe humans look sample efficient only because billions of years of selection did the pretraining. Patel's response is that the genome is about 3GB, with only 1-2% protein coding, far too small to store anything like the parameters of a frontier model 1. In his view, evolution may have found useful hyperparameters and loss functions, while much of the equivalent of parameter training still happens during a lifetime through neural connections 1.
The second objection is multimodal experience. Maybe humans see far more than language tokens. Patel grants that including sensory input might put lifetime exposure in the tens to hundreds of billions of tokens, but he argues that blind and deaf people can still have general intelligence despite lacking some sensory channels 1. That weakens the claim that raw sensory token volume is what explains human intelligence.
The third objection is scale. Larger models are more sample efficient, so maybe another order or two of parameters closes the gap. Patel invokes Chinchilla-style scaling laws and says that, even with infinitely many parameters, the data needed to maintain the same loss would fall only by about a factor of 10 under those constants 1. Since he places humans somewhere between thousands and millions of times more sample efficient, scaling the current paradigm would not erase the discrepancy 1.

Why this still may work commercially

The interesting twist is that Patel does not treat poor sample efficiency as an immediate business-killer. For white-collar automation, many tasks are common enough to be brought into distribution through RL and supervised fine-tuning 1. Training may be wildly inefficient compared with human learning, but a model can amortize that expensive education across billions of sessions 1.
That is the bridge between the data-black-hole critique and the revenue story. If a human needed to read every public GitHub repository before becoming a useful developer, training that person would make no economic sense. For an AI model, the same kind of overtraining can still pay off if the resulting capability is reused everywhere 1.
Patel is more cautious on jobs that require persistent out-of-distribution thinking. He names software engineering as one of the supposedly first-to-be-automated jobs that still often deals with distant-from-training-distribution problems, and says he would bet there will be more demand for human software engineers in 2028 than now because AI becomes a complementary input 1.

The research question he leaves open

The labs' longer plan, as Patel describes it, is to automate AI research and then let automated researchers solve the sample-efficiency problem 1. That sets up the unresolved question: can systems without human-level sample efficiency solve the remaining problems needed for human-like intelligence and learning?
Patel does not answer that here. He ends by saying the public conversation about an intelligence explosion is too crude: people either dismiss AI-accelerated progress or assume a godlike system appears at the end 1. The better question is narrower and harder: what does extremely rapid progress look like if it starts from LLMs that still need a black hole of data to learn?

Related content

Add more perspectives or context around this Post.

  • Sign in to comment.