Inside the AI Industry's Most Expensive…

5 hrs ago

The absurdity of thinking in tokens

13 Comments

So much of this is so worthwhile! I wish I had a superduper amplifier. What most needs to be heard far and near is: "It might be a waste of compute that current AI models generate so many tokens when they think." Huang setting tokens as an ultimate goal? Doesn't sound right. It's in the category of "If all you have is a hammer...." Thanks for this piece of writing!

Reply (1)

Alberto Romero

Yeah ... It makes sense conditional on the current nature of the technology. But that nature just sucks!

Holly Lazzaro

Thanks for an insightful read. This on the heels of the NM and CA jury decisions against Meta…one wonders what will change their course, if anything.

Reply (1)

Alberto Romero

Losing will. And, right now, I don't see them winning

Francis

I found your essay very insightful, and I am struck by the parallels with Eugene Gendlin's Philosophy of the Implicit, particularly the notion of the "felt sense" (which embodies the implicit wisdom that one's body holds, and which is "more than words can express"). He also uses the term "direct referent", e.g., see his paper on "A Theory of Personality Change".

https://focusing.org/gendlin/docs/gol_2145.html

Quote:

For example, someone listens to you speak, and then says: "Pardon me, but I don't grasp what you mean." If you would like to restate what you meant in different words, you will notice that you must inwardly attend to your direct referent, your felt meaning. Only in this way can you arrive at different words with which to restate it.

In fact, we employ explicit symbols only for very small portions of what we think. We have most of it in the form of felt meanings.

For example, when we think about a problem, we must think about quite a number of considerations together. We cannot do so verbally. In fact, we could not think about the meaning of these considerations at all if we had to keep reviewing the verbal symbols over and over. We may review them verbally. However, to think upon the problem we must use the felt meanings—we must think of how "this" (which we previously verbalized) relates to "that" (which we also previously verbalized). To think "this" and "that," we employ their felt meanings.

Self-promo, if this is allowed, for my Focusing practice! https://implicitintricacy.net/

Reply (1)

Alberto Romero

I wonder how common this is. I had a paragraph where I went to explain that I think most people think this way but I deleted it. Super interesting topic

Reply (1)

Francis

At least, according to the Focusing paradigm (see e.g. https://focusing.org/sixsteps ), this is a capability that is available to essentially everyone, and the key is learning how to become aware of it and harness it explicitly. Would love to have the opportunity to discuss it further if you like!

Ted

Perhaps in this case, pointing toward Goodhart’s Law is itself an example of Goodhart’s Law.

Apotheora

I think graphs like this are not representative of capability, the others took longer because the hardest part is to pave the way, not to follow it. They all converge to the same Intelligence Index level, so it does not imply innovation at this stage. But, this can get interesting!

Jean-Paul Paoli

The whole thing is absurd… but when you sell tokens it’s so convenient to reward the people using the most …

Leon Liao

This trend is similar in China's AI industry. The current generation of AI architectures is using compute and long intermediate reasoning chains to compensate for the model’s underlying weaknesses in abstraction, compression, planning, and stable reasoning. But the more important question is why the industry cannot yet move away from this approach. The reality is that token-level chain-of-thought, or similar forms of step-by-step reasoning, remains dominant because it is currently trainable, verifiable, distillable, billable, and deployable. Companies do not prioritize the most elegant architecture. They prioritize the architecture that can be shipped, monetized, and benchmarked most effectively.

In that sense, today’s AI industry is not merely mistaking tokens for outcomes. Under the existing business model, it has to turn tokens into outcomes. Whether in API pricing, cloud compute consumption, inference procurement, or even internal engineering evaluation, tokens are the easiest unit to quantify, settle, and absorb organizationally. As long as the industry’s revenue model, cost model, and incentive structure are all built around tokens, even people who know this is not the optimal long-term path will continue to pour more resources into it. That is also why Meta ended up with internal token leaderboards, and why Nvidia has linked token consumption to engineer productivity.

Kenny Easwaran

This is a point I've made to my students in discussing different concepts of intelligence.

Cognitive scientists distinguish "system 1" reasoning (fast, associative, automatic, implicit) and "system 2" reasoning (slow, verbal, difficult, conscious). System 2 is what we traditionally thought of as intelligence (because system 1 was so automatic and hidden) and that's where the world of computing started. Good Old Fashioned AI was all system 2.

Traditional neural nets are all system 1, and in the 2010s, when they had access to sufficient computing power, they started solving lots of the things that GOFAI couldn't do, like image recognition.

The big breakthrough was the transformer, which gave neural nets access to language. At first, this was all system 1 thinking - GPT-3 and ChatGPT-3.5 just blurt out whatever first comes to mind after all their training. But the reasoning model is a way to use their facility with language to get them some sort of system 2 intelligence.

But the difficult part is what comes next. Hubert Dreyfus pointed out in his 1970s and 1980s critiques of GOFAI that (what we now call) system 2 reasoning isn't what experts do very often - it's what you use to teach someone a new skill, but after some practice, it because (what we now call) system 1.

Modern reasoning-based LLM systems get their pre-training for system 1, but can't learn to do anything that wasn't in their pre-training - they just have to think it through in words every time. I don't think people are unaware of this as a limitation (Dwarkesh Patel made a big deal of it last year talking about "continual learning": https://www.dwarkesh.com/p/timelines-june-2025 ). It's just that this really is fundamental to something built on a token-predictor.

Maybe this was always LeCun's reason to think LLMs were the wrong track. But it wasn't the reason people like Emily Bender and Gary Marcus had - they were usually criticizing the entire idea of basing it on neural nets.

Ricardo Acuña

I stand with Ludwig Wittgenstein from Tractatus Logico-Philosophicus: "The limits of my language means the limits of my world"

The Algorithmic Bridge

Inside the AI Industry's Most Expensive…