So much of this is so worthwhile! I wish I had a superduper amplifier. What most needs to be heard far and near is: "It might be a waste of compute that current AI models generate so many tokens when they think." Huang setting tokens as an ultimate goal? Doesn't sound right. It's in the category of "If all you have is a hammer...." Thanks for this piece of writing!
The possibility that Anthropic's revenue is being materially pushed up by Meta doing distillation for their latest model strikes me as a very big deal.
It is. And I think there's actually little doubt that's at least a factor given that Anthropic's run rate went up ~$10B in barely one month. That's crazy or maybe a crazy org doing something like Claudeonomics lol
There are many posts about Claudeonomics and, of course, also about Muse Spark, but I think you are the first one to draw a (distillation) line. Wild!
Beyond this, your reflections on tokenmaxxing - and the natural human shortcut instinct - teach a lesson about valid metrics of AI usage, in the AI labs and in other industries.
"More of the same" rarely leads to quality leaps; the abundance of AI-generated output should rather trigger a "less is smarter" incentivization.
I was lucky that I was writing about it the moment Meta published it and the connection drew itself. I'm curious to see if others agree or if it's just nonsense haha
I found your essay very insightful, and I am struck by the parallels with Eugene Gendlin's Philosophy of the Implicit, particularly the notion of the "felt sense" (which embodies the implicit wisdom that one's body holds, and which is "more than words can express"). He also uses the term "direct referent", e.g., see his paper on "A Theory of Personality Change".
For example, someone listens to you speak, and then says: "Pardon me, but I don't grasp what you mean." If you would like to restate what you meant in different words, you will notice that you must inwardly attend to your direct referent, your felt meaning. Only in this way can you arrive at different words with which to restate it.
In fact, we employ explicit symbols only for very small portions of what we think. We have most of it in the form of felt meanings.
For example, when we think about a problem, we must think about quite a number of considerations together. We cannot do so verbally. In fact, we could not think about the meaning of these considerations at all if we had to keep reviewing the verbal symbols over and over. We may review them verbally. However, to think upon the problem we must use the felt meanings—we must think of how "this" (which we previously verbalized) relates to "that" (which we also previously verbalized). To think "this" and "that," we employ their felt meanings.
I wonder how common this is. I had a paragraph where I went to explain that I think most people think this way but I deleted it. Super interesting topic
At least, according to the Focusing paradigm (see e.g. https://focusing.org/sixsteps ), this is a capability that is available to essentially everyone, and the key is learning how to become aware of it and harness it explicitly. Would love to have the opportunity to discuss it further if you like!
I was musing about this a few days ago. LLM COT reasoning really feels like an child or elderly person rambling on and on. Their brain isn't there, and them using words only sometimes helps them. If they could just reason inside, THEN speak, they could have better insights. I'll look more into this! I will point out that Nvidia and CUDA and the whole AI system are built on tokens. Is SaSS is falling to AGass and tokens count, what replaces those when we shift in in-context?
It's all very well to say that AI companies should be researching a new way but there is no way to even guess at a ROI for that whereas current approaches seem to follow a clear path of progress, albeit one of ever diminishing returns. It is far more reasonable to hope that current LLMs will get smart enough to help with new AI research or even make a breakthrough on their own, than to expect a significant, human discovered, new paradigm, within a time frame of a year or two.
Meta also hired some truly fantastic researchers (at great expense). Don’t discount their abilities and *crucially* the knowledge they brought from their previous roles, especially when given resources, and the fact that they would have lots of the data, training and eval setup in place, so they would not have been building from scratch. (That’s often the more time consuming bit, compared to model architecture and code.)
I would have expected that their work was greatly accelerated by access to the best tools. Given my own experience with such capabilities, that feels sufficient given the talent pool and resources.
Yeah, didn't want to sound dismissive of that - to be clear, no other org in the world (probably) could have achieved this feat, distillation or not. And yet, I still don't think these other factors are enough given the pace of the other labs. I can't know if they distilled from Opus 4.6 but I wouldn't dismiss the idea. It has happened before
This is a point I've made to my students in discussing different concepts of intelligence.
Cognitive scientists distinguish "system 1" reasoning (fast, associative, automatic, implicit) and "system 2" reasoning (slow, verbal, difficult, conscious). System 2 is what we traditionally thought of as intelligence (because system 1 was so automatic and hidden) and that's where the world of computing started. Good Old Fashioned AI was all system 2.
Traditional neural nets are all system 1, and in the 2010s, when they had access to sufficient computing power, they started solving lots of the things that GOFAI couldn't do, like image recognition.
The big breakthrough was the transformer, which gave neural nets access to language. At first, this was all system 1 thinking - GPT-3 and ChatGPT-3.5 just blurt out whatever first comes to mind after all their training. But the reasoning model is a way to use their facility with language to get them some sort of system 2 intelligence.
But the difficult part is what comes next. Hubert Dreyfus pointed out in his 1970s and 1980s critiques of GOFAI that (what we now call) system 2 reasoning isn't what experts do very often - it's what you use to teach someone a new skill, but after some practice, it because (what we now call) system 1.
Modern reasoning-based LLM systems get their pre-training for system 1, but can't learn to do anything that wasn't in their pre-training - they just have to think it through in words every time. I don't think people are unaware of this as a limitation (Dwarkesh Patel made a big deal of it last year talking about "continual learning": https://www.dwarkesh.com/p/timelines-june-2025 ). It's just that this really is fundamental to something built on a token-predictor.
Maybe this was always LeCun's reason to think LLMs were the wrong track. But it wasn't the reason people like Emily Bender and Gary Marcus had - they were usually criticizing the entire idea of basing it on neural nets.
In an earlier draft I made the effort to disambiguate here: with sensations I'm not referring to system 1 as understood by Kahneman! It's all system 2. I just think that the system 2 is more complex and less linguistic than we are used to thinking.
To be fair, I think most of those sensations need to have been language some time in the past. The only way to have a "hill" in the topology of thought is to have gone there and created it painstakingly with language. Then you can pass by and say: oh sure that thing. But not before language gives it shape (this could have gone into the essay but it was already way too long haha)
This trend is similar in China's AI industry. The current generation of AI architectures is using compute and long intermediate reasoning chains to compensate for the model’s underlying weaknesses in abstraction, compression, planning, and stable reasoning. But the more important question is why the industry cannot yet move away from this approach. The reality is that token-level chain-of-thought, or similar forms of step-by-step reasoning, remains dominant because it is currently trainable, verifiable, distillable, billable, and deployable. Companies do not prioritize the most elegant architecture. They prioritize the architecture that can be shipped, monetized, and benchmarked most effectively.
In that sense, today’s AI industry is not merely mistaking tokens for outcomes. Under the existing business model, it has to turn tokens into outcomes. Whether in API pricing, cloud compute consumption, inference procurement, or even internal engineering evaluation, tokens are the easiest unit to quantify, settle, and absorb organizationally. As long as the industry’s revenue model, cost model, and incentive structure are all built around tokens, even people who know this is not the optimal long-term path will continue to pour more resources into it. That is also why Meta ended up with internal token leaderboards, and why Nvidia has linked token consumption to engineer productivity.
That would make sense if Anthropic was rewarding Meta's engineers or if Meta was rewarding its users but this is Meta rewarding its own engineers; it's not making money out of it but losing it!
I think graphs like this are not representative of capability, the others took longer because the hardest part is to pave the way, not to follow it. They all converge to the same Intelligence Index level, so it does not imply innovation at this stage. But, this can get interesting!
My pre-language intuition is that LeCun is right. That's not to say the LLM "thinking" approach won't produce world-changing impacts that we need to reckon with. But it definitely feels like brute force rather than elegance. Is the elegant approach possible outside of flesh and blood humans? Or is that, perhaps, what truly differentiates (some of) us? With LLMs have we merely reproduced that part of our intelligence that is out of touch with anything beyond neurotic thought loops? Many humans rely on that level of language-based thinking almost exclusively, and create powerful real life impacts as a result. Is it efficient? Healthy? Probably not. But it's no less real and impactful.
"Is the elegant approach possible outside of flesh and blood humans?"
That is the question, and I don't think we know the answer. The idea that a Large Language Model could think and reason in the abstract without using language seems wrong on its face. But clearly Yann LeCun thinks it can. I guess we'll see what happens.
Yet, for all his bluster, LLMs have *finally* produced something compelling in AI (we've silently passed the rubicon of the turing test.. and it's just another tuesday). It's hard not to look at lecun and see a dinosaur from the long AI winter who is bitter he was wrong about much. And maybe he is right, and this is a local maxima but - I can't understand people who are suggesting we shouldn't climb the best and only gradient we've found so far.
So much of this is so worthwhile! I wish I had a superduper amplifier. What most needs to be heard far and near is: "It might be a waste of compute that current AI models generate so many tokens when they think." Huang setting tokens as an ultimate goal? Doesn't sound right. It's in the category of "If all you have is a hammer...." Thanks for this piece of writing!
Yeah ... It makes sense conditional on the current nature of the technology. But that nature just sucks!
The possibility that Anthropic's revenue is being materially pushed up by Meta doing distillation for their latest model strikes me as a very big deal.
It is. And I think there's actually little doubt that's at least a factor given that Anthropic's run rate went up ~$10B in barely one month. That's crazy or maybe a crazy org doing something like Claudeonomics lol
There are many posts about Claudeonomics and, of course, also about Muse Spark, but I think you are the first one to draw a (distillation) line. Wild!
Beyond this, your reflections on tokenmaxxing - and the natural human shortcut instinct - teach a lesson about valid metrics of AI usage, in the AI labs and in other industries.
"More of the same" rarely leads to quality leaps; the abundance of AI-generated output should rather trigger a "less is smarter" incentivization.
I was lucky that I was writing about it the moment Meta published it and the connection drew itself. I'm curious to see if others agree or if it's just nonsense haha
Thanks for an insightful read. This on the heels of the NM and CA jury decisions against Meta…one wonders what will change their course, if anything.
Losing will. And, right now, I don't see them winning
I found your essay very insightful, and I am struck by the parallels with Eugene Gendlin's Philosophy of the Implicit, particularly the notion of the "felt sense" (which embodies the implicit wisdom that one's body holds, and which is "more than words can express"). He also uses the term "direct referent", e.g., see his paper on "A Theory of Personality Change".
https://focusing.org/gendlin/docs/gol_2145.html
Quote:
--
For example, someone listens to you speak, and then says: "Pardon me, but I don't grasp what you mean." If you would like to restate what you meant in different words, you will notice that you must inwardly attend to your direct referent, your felt meaning. Only in this way can you arrive at different words with which to restate it.
In fact, we employ explicit symbols only for very small portions of what we think. We have most of it in the form of felt meanings.
For example, when we think about a problem, we must think about quite a number of considerations together. We cannot do so verbally. In fact, we could not think about the meaning of these considerations at all if we had to keep reviewing the verbal symbols over and over. We may review them verbally. However, to think upon the problem we must use the felt meanings—we must think of how "this" (which we previously verbalized) relates to "that" (which we also previously verbalized). To think "this" and "that," we employ their felt meanings.
--
Self-promo, if this is allowed, for my Focusing practice! https://implicitintricacy.net/
I wonder how common this is. I had a paragraph where I went to explain that I think most people think this way but I deleted it. Super interesting topic
At least, according to the Focusing paradigm (see e.g. https://focusing.org/sixsteps ), this is a capability that is available to essentially everyone, and the key is learning how to become aware of it and harness it explicitly. Would love to have the opportunity to discuss it further if you like!
I was musing about this a few days ago. LLM COT reasoning really feels like an child or elderly person rambling on and on. Their brain isn't there, and them using words only sometimes helps them. If they could just reason inside, THEN speak, they could have better insights. I'll look more into this! I will point out that Nvidia and CUDA and the whole AI system are built on tokens. Is SaSS is falling to AGass and tokens count, what replaces those when we shift in in-context?
Incredible piece; wish I wrote it myself.
Thanks Jurgen!
It's all very well to say that AI companies should be researching a new way but there is no way to even guess at a ROI for that whereas current approaches seem to follow a clear path of progress, albeit one of ever diminishing returns. It is far more reasonable to hope that current LLMs will get smart enough to help with new AI research or even make a breakthrough on their own, than to expect a significant, human discovered, new paradigm, within a time frame of a year or two.
Yep, in practical/economic terms that's what makes sense. And that's what they do
Meta also hired some truly fantastic researchers (at great expense). Don’t discount their abilities and *crucially* the knowledge they brought from their previous roles, especially when given resources, and the fact that they would have lots of the data, training and eval setup in place, so they would not have been building from scratch. (That’s often the more time consuming bit, compared to model architecture and code.)
I would have expected that their work was greatly accelerated by access to the best tools. Given my own experience with such capabilities, that feels sufficient given the talent pool and resources.
Yeah, didn't want to sound dismissive of that - to be clear, no other org in the world (probably) could have achieved this feat, distillation or not. And yet, I still don't think these other factors are enough given the pace of the other labs. I can't know if they distilled from Opus 4.6 but I wouldn't dismiss the idea. It has happened before
This is a point I've made to my students in discussing different concepts of intelligence.
Cognitive scientists distinguish "system 1" reasoning (fast, associative, automatic, implicit) and "system 2" reasoning (slow, verbal, difficult, conscious). System 2 is what we traditionally thought of as intelligence (because system 1 was so automatic and hidden) and that's where the world of computing started. Good Old Fashioned AI was all system 2.
Traditional neural nets are all system 1, and in the 2010s, when they had access to sufficient computing power, they started solving lots of the things that GOFAI couldn't do, like image recognition.
The big breakthrough was the transformer, which gave neural nets access to language. At first, this was all system 1 thinking - GPT-3 and ChatGPT-3.5 just blurt out whatever first comes to mind after all their training. But the reasoning model is a way to use their facility with language to get them some sort of system 2 intelligence.
But the difficult part is what comes next. Hubert Dreyfus pointed out in his 1970s and 1980s critiques of GOFAI that (what we now call) system 2 reasoning isn't what experts do very often - it's what you use to teach someone a new skill, but after some practice, it because (what we now call) system 1.
Modern reasoning-based LLM systems get their pre-training for system 1, but can't learn to do anything that wasn't in their pre-training - they just have to think it through in words every time. I don't think people are unaware of this as a limitation (Dwarkesh Patel made a big deal of it last year talking about "continual learning": https://www.dwarkesh.com/p/timelines-june-2025 ). It's just that this really is fundamental to something built on a token-predictor.
Maybe this was always LeCun's reason to think LLMs were the wrong track. But it wasn't the reason people like Emily Bender and Gary Marcus had - they were usually criticizing the entire idea of basing it on neural nets.
In an earlier draft I made the effort to disambiguate here: with sensations I'm not referring to system 1 as understood by Kahneman! It's all system 2. I just think that the system 2 is more complex and less linguistic than we are used to thinking.
I stand with Ludwig Wittgenstein from Tractatus Logico-Philosophicus: "The limits of my language means the limits of my world"
To be fair, I think most of those sensations need to have been language some time in the past. The only way to have a "hill" in the topology of thought is to have gone there and created it painstakingly with language. Then you can pass by and say: oh sure that thing. But not before language gives it shape (this could have gone into the essay but it was already way too long haha)
This trend is similar in China's AI industry. The current generation of AI architectures is using compute and long intermediate reasoning chains to compensate for the model’s underlying weaknesses in abstraction, compression, planning, and stable reasoning. But the more important question is why the industry cannot yet move away from this approach. The reality is that token-level chain-of-thought, or similar forms of step-by-step reasoning, remains dominant because it is currently trainable, verifiable, distillable, billable, and deployable. Companies do not prioritize the most elegant architecture. They prioritize the architecture that can be shipped, monetized, and benchmarked most effectively.
In that sense, today’s AI industry is not merely mistaking tokens for outcomes. Under the existing business model, it has to turn tokens into outcomes. Whether in API pricing, cloud compute consumption, inference procurement, or even internal engineering evaluation, tokens are the easiest unit to quantify, settle, and absorb organizationally. As long as the industry’s revenue model, cost model, and incentive structure are all built around tokens, even people who know this is not the optimal long-term path will continue to pour more resources into it. That is also why Meta ended up with internal token leaderboards, and why Nvidia has linked token consumption to engineer productivity.
The whole thing is absurd… but when you sell tokens it’s so convenient to reward the people using the most …
That would make sense if Anthropic was rewarding Meta's engineers or if Meta was rewarding its users but this is Meta rewarding its own engineers; it's not making money out of it but losing it!
Perhaps in this case, pointing toward Goodhart’s Law is itself an example of Goodhart’s Law.
I think graphs like this are not representative of capability, the others took longer because the hardest part is to pave the way, not to follow it. They all converge to the same Intelligence Index level, so it does not imply innovation at this stage. But, this can get interesting!
My pre-language intuition is that LeCun is right. That's not to say the LLM "thinking" approach won't produce world-changing impacts that we need to reckon with. But it definitely feels like brute force rather than elegance. Is the elegant approach possible outside of flesh and blood humans? Or is that, perhaps, what truly differentiates (some of) us? With LLMs have we merely reproduced that part of our intelligence that is out of touch with anything beyond neurotic thought loops? Many humans rely on that level of language-based thinking almost exclusively, and create powerful real life impacts as a result. Is it efficient? Healthy? Probably not. But it's no less real and impactful.
Totally agree with your first two sentences. Thinking he's fundamentally correct doesn't take away from the obvious value of LLMs
"Is the elegant approach possible outside of flesh and blood humans?"
That is the question, and I don't think we know the answer. The idea that a Large Language Model could think and reason in the abstract without using language seems wrong on its face. But clearly Yann LeCun thinks it can. I guess we'll see what happens.
I think LeCun is exploring routes other than LLM for that very reason.
Yet, for all his bluster, LLMs have *finally* produced something compelling in AI (we've silently passed the rubicon of the turing test.. and it's just another tuesday). It's hard not to look at lecun and see a dinosaur from the long AI winter who is bitter he was wrong about much. And maybe he is right, and this is a local maxima but - I can't understand people who are suggesting we shouldn't climb the best and only gradient we've found so far.