Here's the fundamental problem: LRMs are forced to present their "reasoning" in human-readable traces that mimic human thought patterns. We're not seeing how these systems actually think—we're seeing them pretend to think like us.
And that's precisely what Apple's study measures: the quality of this performance, not the underlying cognition. Then they extrapolate from performance breakdowns to conclude there's no real reasoning happening at all. It's like denying human intelligence because our performance collapses beyond certain complexity thresholds. Try calculating 847,293 × 652,847 in your head—does your inevitable failure mean you can't think?
Apple's methodology is solid, but their conclusion reveals a deeper confusion. They're measuring machine intelligence by human standards, then acting surprised when it doesn't match up perfectly.
But there's a broader point: LRMs manifest emergent cognitive processing that corresponds neither to classical algorithms nor human cognition. We're exploring uncharted epistemological territory with the same old maps.
What if these systems are developing radically different forms of intelligence? Novel cognitive processing with their own coherencies and limitations that have no human equivalent?
Maybe it's time to stop asking whether AI "really thinks" and start asking what kinds of thinking we're actually witnessing.
When people get to a certain level of complexity, they can't solve it without tools like pen and paper. If you want to compare properly, I think it is not fair unless you do it with or without tools.
By the way, when I tried it, o3 was solved by making a program. Under the same conditions, I think the results would be similar.
No, that's using a tool to solve a problem. It's out of the scope of the original paper but why is that cheating? Humans use tools all the time (including the tools that our brain provides, like memory or visual processing)
But here we are not talking about problem solving, but whether LLMs can do reasoning for complex problems. It’s like if we want to prove Llm can do arithmetic, then a calculator should never be in the picture.
At best, the script thing is irrelevant to his conclusion. But I don’t believe he doesn’t see that, then he is just dishonest.
1. Good post, insightful. And the top comment was especially great, something I'd never even considered before (and I suspect many judging by the response to it).
2. Completely agree re the hacks/grifters just pushing anything and everything for clicks.
3. When I first read that this was an AI breakdown I had these same concerns as @Mark. Without knowing the input it's hard to trust the output as much. For example, if you're going to use AI to breakdown a topic it would be better I think it just link the chat and to also, in the spirit of fairness, prompt it to play both sides and then potentially consider asking it to come to an unbiased conclusion etc.
4. I echo Mark's concerns re the information provided to the model, and I'm not sure I would agree/understand (with) your point that the model already has all the context it needs in it's dataset etc and thus providing the additional link as a search is irrelevant. If this were the case then why would provide the link at all? Because this is a known technique to improve output. The same with the non-provision of cited references, although this ironically would have just caused collapse (and hallucinations) anyway, so it would have been tricky to implement.
5. I tried doing the both-sides/unbiased approach here with G2.5P: https://g.co/gemini/share/60d94db84ea6 -- it would be interesting to see if you had any thoughts or rebuttals to what it said.
What Gemini says there is pretty much what I think (I acknowledge the value of Apple's paper and underscore that the blame is on the influencers who exaggerated the claims, not the paper or the authors). By no means does Apple's paper incite as a takeaway "AI models can't reason," even if the headline is clearly a click bait.
Re your other points: I agree that sharing the chat is useful, unless there's a much longer conversation going on haha. Not sharing it doesn't invalidate the response it gave me, though. I more or less clarified the prompt as well in the two points I say I gave o3.
I don't have enough knowledge to challenge the details, but "charitable" seems like the perfect one-word description of o3's review. It amounts to saying that the authors didn't prove that LLM reasoning isn't possible, merely that it didn't happen in the cases they tested and reasoning might still happen if a few changes are made. It is an answer that human AI afficionados might write. Hope springs eternal!
Actually, it goes a little beyond that. It challenges the premises, e.g., that reasoning traces are not faithful representations of the actual reasoning inside the AI model and thus no research of this kind - without going inside the model to see what it's doing - can provide useful insights (I agree and it's why I hold Anthropic's interpretability research in such high esteem).
But will we ever be able to prove that LLMs aren't reasoning by "going inside the model"? Seems like whenever interpretability researchers fail to find reasoning, they are always going to conclude that they just aren't looking at the model the right way.
yes, that's a possiblity, but actually they've found instances of ill reasoning and bullshitting and similar behaviors, so they seem to be taking this seriously.
I would have trusted the company's instincts on this without thought back in Apple's glory days, but, with current Apple, and how lost it is, how creatively bankrupt it is, this seems more like an attempt to stall, or buy time, or to try and throw shade on its competitors.
Yeah, I agree with your take on Apple's current state. But I'd say the paper is an honest attempt at showing flaws in AI models. I think influencers are at fault here for taking a little piece of truth (AI models indeed rely on pattern-matching and memorization, but they are capable of more) and exaggerating it way too much. This is, after all, a little research team within Apple, which is a giant company. And they made sure to qualify the study. (They chose the headline, which is clearly clickbait but who wouldn't!)
It seems to me that this was a highly biased exercise. You gave o3 only two sources of information: the Apple paper (which you instructed it to critique) and the Anthropic paper, which you assume to be relevant to the critique. But Haiku is a smaller model, and is not a reasoning model, so the relevance is questionable. Furthermore, there is a trove of other relevant research that you did not give to o3 - starting with the union of the 46 references in the Apple paper, and the 95 references in Anthropic's paper. In short, o3 was not properly equipped to provide a valid critique, and was heavily skewed in the limited basis that it was given for its analysis. By your own admission, you had already determined for yourself that Anthropic's paper provided a basis for rebuttal of the Apple researchers, and you set o3 up to reproduce that rebuttal.
If you know how LLMs work, you know this isn't true. I merely limited the *search* to those two papers. o3 has more than enough knowledge to refute the paper without searching at all. It doesn't need 100 references like I don't need 100 references. (In the post I explain I limited the search to these two sources because otherwise o3 finds many more flaws and I wanted to keep the post short.)
Besides, Anthropic's work is not "about Haiku" but about the biology of an LLM. All LRMs are LLMs. That Apple chose to separate them into different categories is a taxonomical choice of little importance. All the results from Anthropic's interpretability work apply in Apple's study, namely, they took the visible reasoning traces as the actual reasoning and that's a serious mistake.
One thing I find fascinating in the discussion around AI and self-expressed reasoning is that there's pretty good evidence that humans themselves don't reason in the way that we express we do.
The Enigma of Reason makes the case that reason is primarily an evolutionarily evolved social mechanism to help us explain why we made certain choices to others. We don't make decisions based on reason, we make them with a mix of intuition and feeling, but then need reason to help others understand our actions.
Perhaps AI is more similar to us than we may think.
" is that there's pretty good evidence that humans themselves don't reason in the way that we express we do."
Given certain contexts this is true; however, in regards to methodic logical reasoning it is not true. We can definitively articulate accurately our reasoning process.
They key proof is that we can describe it and others can reproduce it. So the reasoning process is transferable.
I think the o3 review is great - we should expect more from the authors - like, they didn't even think to upload their own draft paper to gemini pro 2.5 or o3 or opus 4 and ask it "roast this paper". I would have thought folks working on AI research would do this as a matter of course by now.
Here's the fundamental problem: LRMs are forced to present their "reasoning" in human-readable traces that mimic human thought patterns. We're not seeing how these systems actually think—we're seeing them pretend to think like us.
And that's precisely what Apple's study measures: the quality of this performance, not the underlying cognition. Then they extrapolate from performance breakdowns to conclude there's no real reasoning happening at all. It's like denying human intelligence because our performance collapses beyond certain complexity thresholds. Try calculating 847,293 × 652,847 in your head—does your inevitable failure mean you can't think?
Apple's methodology is solid, but their conclusion reveals a deeper confusion. They're measuring machine intelligence by human standards, then acting surprised when it doesn't match up perfectly.
But there's a broader point: LRMs manifest emergent cognitive processing that corresponds neither to classical algorithms nor human cognition. We're exploring uncharted epistemological territory with the same old maps.
What if these systems are developing radically different forms of intelligence? Novel cognitive processing with their own coherencies and limitations that have no human equivalent?
Maybe it's time to stop asking whether AI "really thinks" and start asking what kinds of thinking we're actually witnessing.
When people get to a certain level of complexity, they can't solve it without tools like pen and paper. If you want to compare properly, I think it is not fair unless you do it with or without tools.
By the way, when I tried it, o3 was solved by making a program. Under the same conditions, I think the results would be similar.
Agreed.
I wonder what you think of this “paper”: https://arxiv.org/html/2506.09250v1
Saw it on Twitter but didn't read it as I imagine the points are similar (although surely better explained). Anything that stood out to you?
It is actually a disgrace. See section 5.
Why is it a disgrace? I've read section 5
Don’t you think using a script to solve Hanoi is cheating? This has nothing to do with reasoning.
No, that's using a tool to solve a problem. It's out of the scope of the original paper but why is that cheating? Humans use tools all the time (including the tools that our brain provides, like memory or visual processing)
But here we are not talking about problem solving, but whether LLMs can do reasoning for complex problems. It’s like if we want to prove Llm can do arithmetic, then a calculator should never be in the picture.
At best, the script thing is irrelevant to his conclusion. But I don’t believe he doesn’t see that, then he is just dishonest.
I'm late reading this, here's my thoughts:
1. Good post, insightful. And the top comment was especially great, something I'd never even considered before (and I suspect many judging by the response to it).
2. Completely agree re the hacks/grifters just pushing anything and everything for clicks.
3. When I first read that this was an AI breakdown I had these same concerns as @Mark. Without knowing the input it's hard to trust the output as much. For example, if you're going to use AI to breakdown a topic it would be better I think it just link the chat and to also, in the spirit of fairness, prompt it to play both sides and then potentially consider asking it to come to an unbiased conclusion etc.
4. I echo Mark's concerns re the information provided to the model, and I'm not sure I would agree/understand (with) your point that the model already has all the context it needs in it's dataset etc and thus providing the additional link as a search is irrelevant. If this were the case then why would provide the link at all? Because this is a known technique to improve output. The same with the non-provision of cited references, although this ironically would have just caused collapse (and hallucinations) anyway, so it would have been tricky to implement.
5. I tried doing the both-sides/unbiased approach here with G2.5P: https://g.co/gemini/share/60d94db84ea6 -- it would be interesting to see if you had any thoughts or rebuttals to what it said.
Thank you for the comment!
What Gemini says there is pretty much what I think (I acknowledge the value of Apple's paper and underscore that the blame is on the influencers who exaggerated the claims, not the paper or the authors). By no means does Apple's paper incite as a takeaway "AI models can't reason," even if the headline is clearly a click bait.
Re your other points: I agree that sharing the chat is useful, unless there's a much longer conversation going on haha. Not sharing it doesn't invalidate the response it gave me, though. I more or less clarified the prompt as well in the two points I say I gave o3.
This was a helpful walkthrough!
I don't have enough knowledge to challenge the details, but "charitable" seems like the perfect one-word description of o3's review. It amounts to saying that the authors didn't prove that LLM reasoning isn't possible, merely that it didn't happen in the cases they tested and reasoning might still happen if a few changes are made. It is an answer that human AI afficionados might write. Hope springs eternal!
Actually, it goes a little beyond that. It challenges the premises, e.g., that reasoning traces are not faithful representations of the actual reasoning inside the AI model and thus no research of this kind - without going inside the model to see what it's doing - can provide useful insights (I agree and it's why I hold Anthropic's interpretability research in such high esteem).
But will we ever be able to prove that LLMs aren't reasoning by "going inside the model"? Seems like whenever interpretability researchers fail to find reasoning, they are always going to conclude that they just aren't looking at the model the right way.
yes, that's a possiblity, but actually they've found instances of ill reasoning and bullshitting and similar behaviors, so they seem to be taking this seriously.
I would have trusted the company's instincts on this without thought back in Apple's glory days, but, with current Apple, and how lost it is, how creatively bankrupt it is, this seems more like an attempt to stall, or buy time, or to try and throw shade on its competitors.
Yeah, I agree with your take on Apple's current state. But I'd say the paper is an honest attempt at showing flaws in AI models. I think influencers are at fault here for taking a little piece of truth (AI models indeed rely on pattern-matching and memorization, but they are capable of more) and exaggerating it way too much. This is, after all, a little research team within Apple, which is a giant company. And they made sure to qualify the study. (They chose the headline, which is clearly clickbait but who wouldn't!)
It seems to me that this was a highly biased exercise. You gave o3 only two sources of information: the Apple paper (which you instructed it to critique) and the Anthropic paper, which you assume to be relevant to the critique. But Haiku is a smaller model, and is not a reasoning model, so the relevance is questionable. Furthermore, there is a trove of other relevant research that you did not give to o3 - starting with the union of the 46 references in the Apple paper, and the 95 references in Anthropic's paper. In short, o3 was not properly equipped to provide a valid critique, and was heavily skewed in the limited basis that it was given for its analysis. By your own admission, you had already determined for yourself that Anthropic's paper provided a basis for rebuttal of the Apple researchers, and you set o3 up to reproduce that rebuttal.
If you know how LLMs work, you know this isn't true. I merely limited the *search* to those two papers. o3 has more than enough knowledge to refute the paper without searching at all. It doesn't need 100 references like I don't need 100 references. (In the post I explain I limited the search to these two sources because otherwise o3 finds many more flaws and I wanted to keep the post short.)
Besides, Anthropic's work is not "about Haiku" but about the biology of an LLM. All LRMs are LLMs. That Apple chose to separate them into different categories is a taxonomical choice of little importance. All the results from Anthropic's interpretability work apply in Apple's study, namely, they took the visible reasoning traces as the actual reasoning and that's a serious mistake.
Great point!
One thing I find fascinating in the discussion around AI and self-expressed reasoning is that there's pretty good evidence that humans themselves don't reason in the way that we express we do.
The Enigma of Reason makes the case that reason is primarily an evolutionarily evolved social mechanism to help us explain why we made certain choices to others. We don't make decisions based on reason, we make them with a mix of intuition and feeling, but then need reason to help others understand our actions.
Perhaps AI is more similar to us than we may think.
" is that there's pretty good evidence that humans themselves don't reason in the way that we express we do."
Given certain contexts this is true; however, in regards to methodic logical reasoning it is not true. We can definitively articulate accurately our reasoning process.
They key proof is that we can describe it and others can reproduce it. So the reasoning process is transferable.
I describe in a bit more detail here - https://www.mindprison.cc/i/162270771/are-we-unable-to-describe-our-reasoning
I think the o3 review is great - we should expect more from the authors - like, they didn't even think to upload their own draft paper to gemini pro 2.5 or o3 or opus 4 and ask it "roast this paper". I would have thought folks working on AI research would do this as a matter of course by now.