I.
One overlooked benefit of being optimistic is that the rare times you offer a pessimistic view, people take you more seriously. This is one of those times.
A post, originally published on the Zeropath blog by Dean Valentine that he later viralized on LessWrong under the title “Recent AI model progress feels mostly like bullshit,” offers a bleak perspective. The argument is, basically, that there’s a huge gap between the stated performance gains of AI models on benchmark tests and their competence in the real world. (This story surely rings a bell or two for long-time readers of this blog.)
In other words, recent AI progress is no such thing. And to allude to any LessWronger in the audience: AI progress has manifested on the map but not in the territory.
Everyone should be sharing Valentine’s post. That’s what I’d call a sincere effort at having some epistemic hygiene. But they’re too busy watching Claude Sonnet and Gemini 2.5 get stuck playing Pokemon. Anyway, here’s what he says:
. . . in recent months I've spoken to other YC founders doing AI application startups and most of them have had the same anecdotal experiences: 1. o99-pro-ultra announced, 2. Benchmarks look good, 3. Evaluated performance mediocre. . . .
I have read the studies. I have seen the numbers. Maybe LLMs are becoming more fun to talk to, maybe they're performing better on controlled exams. But I would nevertheless like to submit, based off of internal benchmarks, and my own and colleagues' perceptions using these models, that whatever gains these companies are reporting to the public, they are not reflective of economic usefulness or generality. . . . In terms of being able to perform entirely new tasks, or larger proportions of users' intellectual labor, I don't think they have improved much since August.
Valentine, an Y Combinator founder—so certainly not someone who can be easily labeled a skeptic or a pessimist—says the quiet part out loud. He qualifies it (“since August”) but I think the problem goes deeper and has been going on for longer.
I’ve touched on this in several of my more recent articles. With the help of math professor Daniel Litt (he participated in creating EpochAI’s FrontierMath challenge), I argued that reasoning AIs may not be reasoning that much:
. . . AI models are bound by their interpolation abilities. They can solve in-distribution but not out-of-distribution problems (those with a “shape” they have never seen before). . . . [They] can parse novel problems from non-novel shapes by pattern-matching from the vast latent space they’ve encoded during pre-training, but they can’t think from first principles.
Scientists more knowledgeable than me, like professor Melanie Mitchell, have written in depth on this. The bottom line: The jury is still out; prudence is your best friend when making any claims about AI’s reasoning capabilities.
Inspired by the terrible scores of top models on ARC-AGI 2, I also criticized the disconnect between AI’s performance in benchmarks designed to be inherently hard for large language models and the official narrative (“AGI in two years”):
You can stretch [AI’s] top performance as much as you want—to superintelligence levels even—but if the bottom performance stays in the same place . . . then perhaps what you have is an illusion product of your inability to be a decent evaluator of AI’s real capabilities. We keep mistaking performance with competence, a lesson we should have already learned by now.
It’s not hard to see the evidence of the absence of proof; that is, we know we don't know yet. We ignore whether AI models reason or not for any serious definition of “reasoning” (Mitchell offers one that I agree with here).
So I wonder why other people who, like me, see AI with optimism—who want it to succeed and not just make press headlines saying “AGI in two years!”—can’t come to terms with the fact that benchmarking is not everything and that we might not be as savvy evaluators of AI models as we’d like.
There are important flaws in our preferred approach to AI. They won’t be solved by saturating yet another benchmark or hyping the timelines claiming, over and over and over that AGI is comiiiing in twoooo yeeeears.
Before you read on, a quick note. I write this newsletter in an attempt to make sense of AI. Not to evangelize it, nor to denounce it (those roles are already overcrowded) but simply to understand. And perhaps to offer that understanding to others, should they find themselves similarly disoriented.
This project continues thanks to a small group of generous readers who support it materially through the humble act of subscribing for ~$2/week—roughly the cost of a coffee, though it stays with you longer. If you’ve been reading and find value here—or simply wish for this quiet effort to persist—you are most welcome to join them.
And if you already have: my sincere thanks. This exists because of you.
II.
Valentine continues, suggesting two possible explanations for this annoying gap between performance and competence. One is plausible but controversial. The other, as far as I can tell, hits the nail on the head. Here’s the first one:
. . . maybe there's no mystery: The AI lab companies are lying, and when they improve benchmark results it's because they have seen the answers before and are writing them down. In a sense this would be the most fortunate answer, because it would imply that we're not actually that bad at measuring AGI performance; we're just facing human-initiated fraud. Fraud is a problem with people and not an indication of underlying technical difficulties. I'm guessing this is true in part but not in whole.
This might be part of it, as he says, but not on such aggressive terms. Companies may face strong incentives to do that in ways that can be PR’ed as legal and even licit but even stronger ones not to incur fraud (either blatantly or by mistake). They naturally want their models to be the best if they can make them the best not just sugarcoating some slop product. Also, I gladly assume people are conscientious and hardworking by default, so I wouldn’t resort to this explanation unless I had explicit evidence.