After having tried o1-preview since its release and constantly adding it to my arsenal of LLM solutions to use, I think the future looks bright for OpenAI.
O1 isn’t a replacement to ChatGPT 4, it’s a complement. o1s ability to reason and solve problems outperforms even Claude’s latest model, however, it is time and resources intensive, and it returns so much content that it makes it really hard to iterate results with it. So my workflow is to start an idea with it, take elements of the answer to pass to gpt4 for iteration, formulate and fix the whole answer, then send the whole thing back to o1 for re-evaluation. The results you get are far far far superior to any other LLM I have tried…. Unfortunately it’s not so simple to use. But learning how to use it better has improved my efficiency and output, particularly in coding, problem-solving, and content generation.
The fixation on benchmark improvements between model generations misses a crucial point: we've barely scratched the surface of what's possible with existing models. The shift toward test-time computation and reasoning described in the article points to a broader truth - perhaps the next breakthroughs won't come from raw model size, but from smarter deployment strategies, better interfaces, and more efficient architectures that prioritize real-world utility over benchmark scores.
Gary Marcus has been short deep learning since at least 2013. He was wrong for more than a decade. Some day, he may be right. But he'll be right for the wrong reasons. The advances of deep learning have drained his pessimism of any authority.
Yep, this is a good way to frame it. (I believe his view that deep learning isn't enough is correct but I also don't see anyone who denies that and the degree to which it has worked squares against his pessimism)
Regarding the CoT used in o1, I wonder if it involves a lot of sequential workload and perhaps some branching as well. If so, then what is the best hardware to run this test-time compute? Would it still be GPU, given that these workload might be hard to parallelize especially if branching is involved?
Good question. As far as I know GPUs have never been the best hardware (as in more efficient) to do inference on. It just happens that Nvidia has a near monopoly on AI-suited chips. Companies can't risk having an inferior software stack or worse customer service, or worse compatibility, or less supply availability, or lower client preference for the next gen of chips, etc. just for a bit more optimization (also there aren't many options that could really compete with Nvidia even on efficiency, but there are some, like Cerebras, Groq or Tenstorrent). (Also, of you're really interested in hardware, you should subscribe to Semianalysis)
Hi. I had already subscribe to SemiAnalysis. I feel thinking from both the software and hardware angle seems to be able to give me a better understanding of this.
On another note, can I ask if CoT implementation would result in an increase of energy consumption also. I mean we are already talking about building nuclear reactors to power a data center
This is a good article for a reader to gauge their bias. Did they react when Alberto said Altman was wrong? What were they feeling and why?
I certainly reacted. I realised I was instinctively hoping Alberto took the middle line. And was personally offended when he didn’t. On reflection I have a lot of hope in AI development to solve many of the problems that plague me and humanity, and sometimes when people critique it’s progress (particularly if I disagree with the take) my emotions light up.
Although not the direct intention of the article, it provided an interesting means to increased awareness of my own beliefs and biases. Likely it will do the same for others? Thank you Alberto
Oh but I did take the middle line, although perhaps in a subtle way. The last section is about that (also the third one: expensive and slow don't equate worse or wrong)
Yes, but this is only clear in hindsight once the full article is read. Emotional reactions are immediate. Your last section lowers the bar for people who had strong emotionals reactions to later self-reflect and engage their frontal lobes.
Hope that's true! Wouldn't want anyone to react strongly and not get to the end (it happens but then I perhaps didn't really want those people to read my stuff in the first place)
To be honest, I respect you a lot Alberto, but I’m sad to see you just accept “Gary is right.” The answer is always in the middle. The next models don’t solve the problems we expected, or the step isn’t clear, so it’s a much bigger product question to land these models in a way that keeps the narratives going. It’s a problem for the companies, but it’s not a doom for scaling nor for the approach. In decades, I don’t think takes like this age well.
It’s a mix. The o1 stuff was good. I think I also have a visceral reaction to people nodding their hat to Gary (most people in building ai just completely block him). But even the GPT news is a bit overblown.
I did listen the whole thing, but as I said, maybe I was triggered.
It looks like it is happening beyond OpenAI, too. Here is what Bloomberg is saying today:
OpenAI was on the cusp of a milestone. The startup finished an initial round of training in September for a massive new artificial intelligence model that it hoped would significantly surpass prior versions of the technology behind ChatGPT and move closer to its goal of powerful AI that outperforms humans. But the model, known internally as Orion, did not hit the company’s desired performance, according to two people familiar with the matter, who spoke on condition of anonymity to discuss company matters. As of late summer, for example, Orion fell short when trying to answer coding questions that it hadn’t been trained on, the people said. Overall, Orion is so far not considered to be as big a step up from OpenAI’s existing models as GPT-4 was from GPT-3.5, the system that originally powered the company’s flagship chatbot, the people said.
OpenAI isn’t alone in hitting stumbling blocks recently. After years of pushing out increasingly sophisticated AI products at a breakneck pace, three of the leading AI companies are now seeing diminishing returns from their costly efforts to build newer models. At Alphabet Inc.’s Google, an upcoming iteration of its Gemini software is not living up to internal expectations, according to three people with knowledge of the matter. Anthropic, meanwhile, has seen the timetable slip for the release of its long-awaited Claude model called 3.5 Opus.
As expected, the quality and quantity of data is the challenge as they have already scrapped what can be scrapped or downloaded:
The companies are facing several challenges. It’s become increasingly difficult to find new, untapped sources of high-quality, human-made training data that can be used to build more advanced AI systems. Orion’s unsatisfactory coding performance was due in part to the lack of sufficient coding data to train on, two people said. At the same time, even modest improvements may not be enough to justify the tremendous costs associated with building and operating new models, or to live up to the expectations that come with branding a product as a major upgrade.
Gary is great at providing views in the extremes, but I do like to hear his views. Gives perspectives to shape my own. I do think he’s right though, just having a model company doesn’t work long term for me. As most users only scratch the surface and the power of foundation models, and we have a long runway before we find the depth and power of LLMs.
So where do they go? Reasoning sounds amazing, but about what? How can you take the power it provides and implement enterprise scale solutions? How can reasoning rebuild the enterprise stack and firms redoing their business and operating models?
Being a computer scientist at heart and really loving new tech, I’m excited to see all developments. But then, I hit the reality and practicality of any tech. Sad to say, but foundation model power is so far ahead of real production scale applications. Hence, I’m trying to bridge the gap by bringing the power to the people!
Funny how things change fast in the AI world. I've been following Gary Marcus for a while. You did not mention others that are less obnoxious than him :-) - Yann LeCun is on the same boat I think and also François Chollet. AGI is more than a souped-up LLM. It is tempting to think that real reasoning is an emergent property of scaling. That's what Ilya Sutskever seems to believe and he is trustworthy for sure but as you say that 's not a law of physics, that's an observation that we'd like to be true. Altan himself talks about "beliefs" and "bets". The end of your story was amazing thought ;-)
I learned today that Anthropic named Claude after Claude Shannon, the inventor of the information theory. He famously said “I visualize a time when we will be to robots what dogs are to humans, and I’m rooting for the machines.” He said this in ... 1984.
On the same side of California sits a very well-known AI engineer, he checks the results of running, whatever the successor of o1 then would be, on their updated benchmark and smiles, it would be a very very long time, if ever, before LLMs get anywhere near human level of performance on ARC.
After having tried o1-preview since its release and constantly adding it to my arsenal of LLM solutions to use, I think the future looks bright for OpenAI.
O1 isn’t a replacement to ChatGPT 4, it’s a complement. o1s ability to reason and solve problems outperforms even Claude’s latest model, however, it is time and resources intensive, and it returns so much content that it makes it really hard to iterate results with it. So my workflow is to start an idea with it, take elements of the answer to pass to gpt4 for iteration, formulate and fix the whole answer, then send the whole thing back to o1 for re-evaluation. The results you get are far far far superior to any other LLM I have tried…. Unfortunately it’s not so simple to use. But learning how to use it better has improved my efficiency and output, particularly in coding, problem-solving, and content generation.
The fixation on benchmark improvements between model generations misses a crucial point: we've barely scratched the surface of what's possible with existing models. The shift toward test-time computation and reasoning described in the article points to a broader truth - perhaps the next breakthroughs won't come from raw model size, but from smarter deployment strategies, better interfaces, and more efficient architectures that prioritize real-world utility over benchmark scores.
Yeah, sounds about right
Gary Marcus has been short deep learning since at least 2013. He was wrong for more than a decade. Some day, he may be right. But he'll be right for the wrong reasons. The advances of deep learning have drained his pessimism of any authority.
Yep, this is a good way to frame it. (I believe his view that deep learning isn't enough is correct but I also don't see anyone who denies that and the degree to which it has worked squares against his pessimism)
Well how reliable is the Information any way
Haha read the entire thing! (Also, quite reliable given its history of accurate scoops)
Like the time they said OpenAI was gonna go bankrupt by end of 2024?
Hmm I'm not sure they ever said that. They said OpenAI will have to raise more money if they want to keep going, which isn't wrong
Regarding the CoT used in o1, I wonder if it involves a lot of sequential workload and perhaps some branching as well. If so, then what is the best hardware to run this test-time compute? Would it still be GPU, given that these workload might be hard to parallelize especially if branching is involved?
Good question. As far as I know GPUs have never been the best hardware (as in more efficient) to do inference on. It just happens that Nvidia has a near monopoly on AI-suited chips. Companies can't risk having an inferior software stack or worse customer service, or worse compatibility, or less supply availability, or lower client preference for the next gen of chips, etc. just for a bit more optimization (also there aren't many options that could really compete with Nvidia even on efficiency, but there are some, like Cerebras, Groq or Tenstorrent). (Also, of you're really interested in hardware, you should subscribe to Semianalysis)
Hi. I had already subscribe to SemiAnalysis. I feel thinking from both the software and hardware angle seems to be able to give me a better understanding of this.
On another note, can I ask if CoT implementation would result in an increase of energy consumption also. I mean we are already talking about building nuclear reactors to power a data center
In reality, no exponential curve is truly exponential. Almost all of them end up becoming a logistic curve.
100%. We'll see if the argument "put several S-curves together and you have a exponential" is true
Stairway to Heaven. Or Hell. We'll see.
This is a good article for a reader to gauge their bias. Did they react when Alberto said Altman was wrong? What were they feeling and why?
I certainly reacted. I realised I was instinctively hoping Alberto took the middle line. And was personally offended when he didn’t. On reflection I have a lot of hope in AI development to solve many of the problems that plague me and humanity, and sometimes when people critique it’s progress (particularly if I disagree with the take) my emotions light up.
Although not the direct intention of the article, it provided an interesting means to increased awareness of my own beliefs and biases. Likely it will do the same for others? Thank you Alberto
Oh but I did take the middle line, although perhaps in a subtle way. The last section is about that (also the third one: expensive and slow don't equate worse or wrong)
Yes, but this is only clear in hindsight once the full article is read. Emotional reactions are immediate. Your last section lowers the bar for people who had strong emotionals reactions to later self-reflect and engage their frontal lobes.
Hope that's true! Wouldn't want anyone to react strongly and not get to the end (it happens but then I perhaps didn't really want those people to read my stuff in the first place)
To be honest, I respect you a lot Alberto, but I’m sad to see you just accept “Gary is right.” The answer is always in the middle. The next models don’t solve the problems we expected, or the step isn’t clear, so it’s a much bigger product question to land these models in a way that keeps the narratives going. It’s a problem for the companies, but it’s not a doom for scaling nor for the approach. In decades, I don’t think takes like this age well.
Hey Nathan, if you read it in full, that can't possibly be your takeaway! Or perhaps I failed to convey what I wanted...
It’s a mix. The o1 stuff was good. I think I also have a visceral reaction to people nodding their hat to Gary (most people in building ai just completely block him). But even the GPT news is a bit overblown.
I did listen the whole thing, but as I said, maybe I was triggered.
anyways. Thanks for the response. I'll try to be a more engaged follower, mostly just skimmed in the past.
It's well worth the full read, and Alberto's writing style will take you on a journey.
It looks like it is happening beyond OpenAI, too. Here is what Bloomberg is saying today:
OpenAI was on the cusp of a milestone. The startup finished an initial round of training in September for a massive new artificial intelligence model that it hoped would significantly surpass prior versions of the technology behind ChatGPT and move closer to its goal of powerful AI that outperforms humans. But the model, known internally as Orion, did not hit the company’s desired performance, according to two people familiar with the matter, who spoke on condition of anonymity to discuss company matters. As of late summer, for example, Orion fell short when trying to answer coding questions that it hadn’t been trained on, the people said. Overall, Orion is so far not considered to be as big a step up from OpenAI’s existing models as GPT-4 was from GPT-3.5, the system that originally powered the company’s flagship chatbot, the people said.
OpenAI isn’t alone in hitting stumbling blocks recently. After years of pushing out increasingly sophisticated AI products at a breakneck pace, three of the leading AI companies are now seeing diminishing returns from their costly efforts to build newer models. At Alphabet Inc.’s Google, an upcoming iteration of its Gemini software is not living up to internal expectations, according to three people with knowledge of the matter. Anthropic, meanwhile, has seen the timetable slip for the release of its long-awaited Claude model called 3.5 Opus.
As expected, the quality and quantity of data is the challenge as they have already scrapped what can be scrapped or downloaded:
The companies are facing several challenges. It’s become increasingly difficult to find new, untapped sources of high-quality, human-made training data that can be used to build more advanced AI systems. Orion’s unsatisfactory coding performance was due in part to the lack of sufficient coding data to train on, two people said. At the same time, even modest improvements may not be enough to justify the tremendous costs associated with building and operating new models, or to live up to the expectations that come with branding a product as a major upgrade.
Gary is great at providing views in the extremes, but I do like to hear his views. Gives perspectives to shape my own. I do think he’s right though, just having a model company doesn’t work long term for me. As most users only scratch the surface and the power of foundation models, and we have a long runway before we find the depth and power of LLMs.
So where do they go? Reasoning sounds amazing, but about what? How can you take the power it provides and implement enterprise scale solutions? How can reasoning rebuild the enterprise stack and firms redoing their business and operating models?
Being a computer scientist at heart and really loving new tech, I’m excited to see all developments. But then, I hit the reality and practicality of any tech. Sad to say, but foundation model power is so far ahead of real production scale applications. Hence, I’m trying to bridge the gap by bringing the power to the people!
Loved the ending!
Funny how things change fast in the AI world. I've been following Gary Marcus for a while. You did not mention others that are less obnoxious than him :-) - Yann LeCun is on the same boat I think and also François Chollet. AGI is more than a souped-up LLM. It is tempting to think that real reasoning is an emergent property of scaling. That's what Ilya Sutskever seems to believe and he is trustworthy for sure but as you say that 's not a law of physics, that's an observation that we'd like to be true. Altan himself talks about "beliefs" and "bets". The end of your story was amazing thought ;-)
I learned today that Anthropic named Claude after Claude Shannon, the inventor of the information theory. He famously said “I visualize a time when we will be to robots what dogs are to humans, and I’m rooting for the machines.” He said this in ... 1984.
On the same side of California sits a very well-known AI engineer, he checks the results of running, whatever the successor of o1 then would be, on their updated benchmark and smiles, it would be a very very long time, if ever, before LLMs get anywhere near human level of performance on ARC.
Sam Altman is a persistent liar by now.
Always has been a story for the investors. Plus some researchers have blind faith.
I don't think you've actually been paying attention...