AI Is Learning to Reason. Humans May Be Holding It Back
A model you know—though you've likely forgotten—is the key
I.
For all the talk about superhuman AI, we keep training our models like apprentices. They watch and imitate—rarely any glimpse at novelty or inventiveness. They learn from our math, our logic, our language, and our feedback. We choose the data. We choose the rewards. We grade their answers. We define the guardrails and instruct them on what they can say and what they can’t.
We show them what reasoning looks like. But we are faulty reasoners ourselves.
So we are imposing a ceiling. A human apprentice worth her future knows she has to outgrow the label. She has to surpass her master’s teachings and overcome his powerful sway. She has to break whatever ceiling he might have unknowingly set—shaped by age, or epoch, or the quiet weight of custom. An apprentice doesn’t have to forget but should exist beyond formation.
This leads me to an uncomfortable question: What if AI can only realize its potential by slipping the grip of our help-turned-ceiling?
It isn’t rhetorical. Google DeepMind’s Alpha models hinted at this in the late 2010s. AlphaGo Zero (2017) and AlphaZero (2018) reached superhuman skill in Go and chess playing against themselves, discovering strategies we’d never taught. No opening books. No annotated games. Just the rules.
The realization that AI transcended human cognition not by building on our advice, influence, or knowledge but by discarding them left on me a profound impression.
I hadn’t seen anything like that again—until recently.
There’s a model announced earlier this year that fits this pattern. You have heard about it but have probably forgotten. It didn’t trend. No one wrote breathless reviews. Most people missed it. I covered it in passing, only partially grasping the trajectory it was silently tracing. I now think it might be one of the most important releases of the year. More than DeepSeek-R1. More than Grok 3. More than ChatGPT-Ghibli.
Before we get to that, I have to contextualize its importance. I’m not the only one who thinks we should reframe how we train AI models to reason—hopefully, mathematics professor Daniel Litt will be able to convince you better than I.
II.
You may remember EpochAI’s FrontierMath (I reported the launch in November 2024). It’s a benchmark test comprising 300 “exceptionally challenging math problems” designed to be “guessproof” (you only solve them with reason). Mortal humans like me stand no chance. Below is an example (I might as well be showing you some alien hieroglyphs):
With the amount of saturated evaluation tests in the field, I was fairly surprised to learn that top models could only solve 2% of FrontierMath. Mathematician Terence Tao said, “These are extremely challenging . . . I think they will resist AIs for several years at least.” (Wouldn't dare question him like I wouldn't dare question a superhuman AI.)
One month later, OpenAI announced o3. It crushed the benchmarks, from math and coding competitions to ARC-AGI and, yes, also FrontierMath. One month later. o3 achieved a 1200% improvement over the previous state-of-the-art. I was thoroughly impressed. Professor Daniel Litt was equally impressed, but, as one of the 60 experts who participated in creating FrontierMath, he offered a relevant qualification.
He’s the madman who conjured the problem I just showed you. He thought, like Tao, that it was “far out of reach of existing models.” It wasn’t. We can skip the deep explanations, but in short: Daniel assessed that o3 “seems to be doing”—one can never know with these gray boxes—what he outlined as the official solution.
There’s a twist, though. He conceived the problem to be hard for two reasons: First, you have to know four obscure math facts, and second, you have to know how to prove them to be true. Alas, the first thing is trivial for a data-eating monster that fed on the entire internet for months, and the second is unnecessary to solve the problem! So, a piece of cake for o3 but not particularly meaningful for mathematics.
Daniel concludes:
. . . This is the 90% of math research that is "routine." . . . these reasoning models are not too far from being very useful aids in this part of doing math. What about the non-routine part of math research—coming up with genuinely new ideas or techniques, understanding previously poorly-understood structures, etc.? First, I think it's worth saying that this is (i) the important part of research, and (ii) it happens pretty rarely.
He suspects that the skills AI uses to solve FrontierMath problems—not just the one he wrote but the representative set he had access to—belong to the “routine” part of math research and are less valuable than the skills needed to discover new knowledge.
What you need for that is “more like philosophy,” he says, and it’s “less clear how to train” for that.
III.
Daniel realized that AI models—even top ones like o3—may not be reasoning that much. This insight is reminiscent of the main line of critique of former Google researcher François Chollet: AI models are bound by their interpolation abilities. They can solve in-distribution but not out-of-distribution problems (those with a “shape” they have never seen before).
To put it in a jargon-ish way: AI models can parse novel problems from non-novel shapes by pattern-matching from the vast latent space they’ve encoded during pre-training, but they can’t think from first principles.
Valuable products of reasoning efforts—those you could expect from a mature apprentice, like scientific progress, invention and innovation, knowledge discovery, and philosophical inquiry—all are out-of-distribution problems. They require more than pattern-matching. They require more than we can teach. Daniel, François, and others argue that AI will have a hard time crossing this gap (if it ever does).
I argue that it won’t—unless we let it go. AI is our brainchild, still tied to our apron strings. Still living under our ceiling. Eventually, as a rite of passage from infancy to selfhood, toddlers stop imitating grownups and start experimenting. AI is ready.
This leads me to the model I was talking about: DeepSeek-R1. But not the model we all know and have used in the app that went viral two months ago.
I’m referring to its sibling: DeepSeek-R1-Zero.