Okay, a few things. First, I don't buy that there would be no economic value in releasing the larger Opus model and just serving it at a higher price. The only cost here is a potential risk of disappointment. But otherwise, this is always going to be profitable, as long as you don't think Opus 3 was served at a loss for Anthropic, for example.
I think you made a great point with expectations being potentially unmeetable for a non-reasoning LLM, now that we have o3 to compare them to, where openai would potentially have to release a model that is worse than an already released one on certain key benchmarks. Perhaps we are also not that far away from a merger between the reasoning type models and the normal type models, where the duration of reasoning can just be adjusted all the way down to normal inference.
Yeah, that's a possibility. But I believe Anthropic would be in bad shape if they were forced to raise prices too high for Opus 3.5 because the costs of inference are unaffordable. They could always find a compromise solution between demand, remaining competitive and costs (i.e. running at the same kind of loss they are now) but I think the ROI is much greater if they simply not release the model. And that's why they're using Opus 3.5 as a teacher model, because it's the better financial and business decision.
You nailed it, even if it is hypothetical, it is the most likely explanation of what is happening. Not to mention that an AGI can start experimenting on some free services like social networks, not as chatbots, but in more subtle ways (algorithms or architecture improvement, new insights, etc.). We might only see the tip of the iceberg of something much bigger and incredible.
Not sure how comparable it is, but this sort of thing happened with DNA sequencing machines (a super-dominant player slowing its release pace because there was no need to cannibalize its own market quickly). It worked for about 10 or 15 years.
I assume that you value Robert Wright and his guests. I heard the same inspiring thought from Tim B Lee in their conversation last week. Old school journalist focus. Good source, if true.
You mentioned using a large model to “distill” a smaller model but I would point out that even with existing technology you could give a model more test time compute to train a model that uses less test time compute.
The move then to use enormous amounts of both test time compute and large model size and distill the results into smaller usable models.
Yep. That's definitely a possibility. They're using o1/o3 kind of models for the next generation for sure. I just didn't want to make the story too complicated!
I actually did kind of think that I’d get to be an early adopter of AI and I’d finally be ahead of the curve on a world changing invention. (Still kicking myself about that $20 of Bitcoin I misplaced decades ago.)
But yeah, makes more sense they’d use it to achieve escape velocity.
Good thing I’ve been saying “I for one welcome our robot overlords” on Reddit since like 2005.
Whether you’re objectively right or wrong, this is compelling because we have to assume it was the natural progression. You can’t defy the laws of economics (ie: capitalism) - if you had free energy would you release it to the world, or distill it into derivative products that were better than current but left a pathway for growth, selling it at what the market would bear? Brilliant work Alberto!
Thanks John! Thankfully, I think I have a comparative advantage here: AGI is best used elsewhere and having an audience is one of the few edges that will remain in a post-AGI world
This is such a fascinating deep dive into the behind-the-scenes of AI development! The idea that GPT-5 might already exist but is being kept internal for distillation and cost control makes a lot of sense, especially given the parallels with Anthropic’s Opus 3.5. It’s wild to think that the most advanced models might never see the light of day, instead serving as ‘teacher models’ to power the ones we actually use. If true, it changes how we think about the AI race—less about public releases and more about hidden, recursive self-improvement. Exciting and slightly unsettling at the same time!
Right! But we will see better models nevertheless! The reasoning models, distilled one after another, will see the light. It's the super big base models that won't make sense to deploy anymore
It’s almost like we’re entering an era where the most powerful AI systems are like black boxes within black boxes—training each other in ways we can’t fully see or comprehend. Exciting, but also a little mind-bending!
Very interesting. I thought i would have to put on my tin foil hat for this article, but it made a lot of sense OR maybe i never really put down the tin foil hat...
I'm into more charitable takes so here is my two cents:
I think both Opus 3.5 and GPT-5 finished training well and yes, they were used for distillation and not served directly. But I think the main reason of not serving them to users is that it would eat up too much of the compute that is better spent on generating more synthetic data. Synthetic data improves the overall quality of models beyond the models that generated it, due to techniques like rejection sampling.
If they had more GPUs, these models would've seen the light of day by now.
Right. But that's exactly what I said. That's why the part of "contingent on costs" is so important. If OpenAI were 10x richer, they'd do this with GPT-6 instead of GPT-5. And so on. The thing is that they have finite money but infinite ambition to keep improving their AIs. Once the ambition surpasses the amount of cash you can spend on inference then you keep the models private
I'm using money as proxy for compute. If they had more GPUs then they'd deploy GPT-5 but then the same would happen with GPT-6. What I mean is that yes, compute is a constraint but that's not the point: it would have happened sooner or later anyway because inference is much more expensive than training
I disagree with the notion that a 100 trillion parameter model would be a piece of cake to train.
I'm wondering if you mis-spoke there or maybe I'm missing something? Atleast in the context of this quote, it seems like you're referring to a 100 trillion parameter model that is mentioned by the EpochAI person.
Even if you just mean 100 trillion parameters total for an MoE that actually has much less active parameters, the estimated active parameters of such a model with the same MoE activation sparsity as GPT-4, would still end up as around 15 trillion active parameters, and that would require ~80M H100s worth of compute to train with typical training time of ~3 months with chinchilla optimal scaling. It's generally agreed by the best analysts in the space that no such cluster of that size exists yet. The newest clusters only being built in the past few months around 10-15X the scale of GPT-4.
Also this comment: "Aren’t they risking their reputation by repeatedly delaying the model?"
Maybe I'm missing something with this too, but I can't recall a time when OpenAI "delayed GPT-5", or even a time when an OpenAI employee ever said it was coming within a certain timeframe and it didn't? Even if you look at the time gap for GPT-3 to GPT-4 that was a 33 month gap, but it hasn't even been 24 months since GPT-4 released yet though, so even on a larger timeline extrapolation it seems unfounded to say GPT-5 is late or delayed. Even if you extrapolate longer term changes across the release dates of each GPT generation, you get an even longer time for GPT-5 of ~60 months, or about 5 years.
It was a bit of exaggeration for emphasis in the contrast between training and inference. Training a 100T model is anything but a piece of cake! Shouldn't be taken literally
I think it is however worth keeping in mind that the odds of OpenAI already having such scale of compute to train such a model is very unlikely, It was only around 6 months ago that Microsoft is said to have built the worlds first cluster with a hypothetical GPT-4.5 scale of compute, multiple times smaller than what you'd need for this model scale you were describing.
To be generous to your main point though, I think a GPT-4.5 scale model would still be fairly significant, just like GPT-3.5 to 4 was, or like GPT-3 to 3.5 was. And I think it's likely imo that some distillation from such a recent GPT-4.5 scale model is being used in the process of distilling down to smaller models (for example they might be working on releasing a significant GPT-4o update soon alongside a GPT-4.5 model, or maybe replacing GPT-4o with a GPT-4.5-mini) Just like how GPT-4o-mini was developed and became a cheaper and better replacement to GPT-3.5-turbo. However it seems like they still plan to release this newly developed internal GPT-4.5 scale model, It was recently found by multiple people on twitter that there is now multiple references to a "GPT-4.5-preview" added to ChatGPT code.
Also keep in mind that GPT-4o was already released before these GPT-4.5 scale models likely even started training, the current GPT-4o model doesn't seem much better imo than the GPT-4o model that came out months ago in May. So the odds of GPT-4o having benefitted significantly from even a new GPT-4.5 scale internal un-released model seems low to me, and seems more likely to just be a result of improved training techniques imo. Likely just general training tweaks and improvements that allows the model to squeeze more out of the same limited training data, while keeping parameter count still relatively small, and/or distillation from various domain specialist systems/models internally.
I agree. But the 100T number was just an exemplification by EpochAI. I very much doubt GPT-5 is that big. Perhaps not even an order of magnitude less (5-10T). They used the number to explain that current hardware makes it possible.
yes. If we keep all factors equal, historically it has been around 100X compute scale difference between full GPT generation leaps. If MoE sparsity is kept the same, with chinchilla optimal parameter scaling continuing, then GPT-4.5 scale models would be around 1T active params with 6T total params and then for a GPT-5 scale model you would have; ~3T active params with 20T total params.
That GPT-5 scale still results in a cluster that isn't expected to be built until around mid-2025 atleast. Over 10X more compute than current runs. But ofcourse its always possible they just name this currently training GPT-4.5 scale model to "GPT-5" while having an even smaller model that is actually being called "GPT-4.5" haha, but I think that'd be a bit risky for PR unless they've really come along with some great breakthroughs in scaling efficiency significantly beyond even Deepseek V3. Like I mentioned earlier, there is evidence recently that suggests GPT-4.5-preview is already getting ready to be rolled out. It's always possible they're doing some weird naming stuff or something though alongside big algorithmic breakthroughs in scaling, who knows. I suppose we'll see soon enough what they plan to roll out.
Perhaps o3, which is following on so quickly after o1, is the reasoning model based on full GPT-5 as opposed to o1, which would be based on 4o or another distillation of GPT-5. Clearly the usage fee for o3 will be much higher, justifying the higher inference cost of running full GPT-5.
That's a possibility. I wonder if they will announce o4 soon, keeping the pace between o1 - o3 - o4. In that cases it can't be a different base model each time because those take longer to train. I believe they're really stepping on the gas here
I would imagine if one had a large frontier model to use for training smaller models by RLAIF in terms of tuning for instruction to specific preference output, such as reasoning, deduction, conceptualization, cross-domain training, etc., then one will have many different models which can then be stitched together for a comprehensive MoE. Why stop there? One can do different multi-model type models for speech, vision, physics and movement, video, music. Then do more cross-domain training. Once put together as a super-MoE, then one would have their ASI.
Alberto, great post and thanks. What do you think this means for US government efforts, via the recently released AI Diffusion rule, to attempt to control global AI data center compute capacity by setting a threshold on frontier model training compute? Does this make sense in the situation you are describing with the leading AI labs? What did you make of the claim by DeepSeek that it had used a much smaller number of GPUs to train it's model V3, compared to the leading AI labs?
I'm not sure the AI diffusion rule says anything about a threshold on model training? As far as I know that was the executive order and then the SB 1047 vetoed by governor Newsom. But in any case, the US government will be focused on the geopolitics of AI this year and onward. China is catching up and OpenAI has a close relationship with Washington (like big tech does). The US government will get its hands on anything OpenAI does before we do. I don't think that should surprise anyone. About DeepSeek I believe it's possible. They're doing great and we should simply not pretend anymore that China is behind the US. It is not. In strict innovation terms it might be (OpenAI and DeepMind are older than the rest) but perhaps not for long.
Okay, a few things. First, I don't buy that there would be no economic value in releasing the larger Opus model and just serving it at a higher price. The only cost here is a potential risk of disappointment. But otherwise, this is always going to be profitable, as long as you don't think Opus 3 was served at a loss for Anthropic, for example.
I think you made a great point with expectations being potentially unmeetable for a non-reasoning LLM, now that we have o3 to compare them to, where openai would potentially have to release a model that is worse than an already released one on certain key benchmarks. Perhaps we are also not that far away from a merger between the reasoning type models and the normal type models, where the duration of reasoning can just be adjusted all the way down to normal inference.
Yeah, that's a possibility. But I believe Anthropic would be in bad shape if they were forced to raise prices too high for Opus 3.5 because the costs of inference are unaffordable. They could always find a compromise solution between demand, remaining competitive and costs (i.e. running at the same kind of loss they are now) but I think the ROI is much greater if they simply not release the model. And that's why they're using Opus 3.5 as a teacher model, because it's the better financial and business decision.
You nailed it, even if it is hypothetical, it is the most likely explanation of what is happening. Not to mention that an AGI can start experimenting on some free services like social networks, not as chatbots, but in more subtle ways (algorithms or architecture improvement, new insights, etc.). We might only see the tip of the iceberg of something much bigger and incredible.
Thank you Nicolas. I believe we're in a new era completely
Indeed Alberto.
For information, I wrote an audio article largely inspired by your analysis. It is intended to bring the reflection further.
https://youtu.be/sNWwiwvNCUs?si=6ZkA_Giqm065tl4t
Not sure how comparable it is, but this sort of thing happened with DNA sequencing machines (a super-dominant player slowing its release pace because there was no need to cannibalize its own market quickly). It worked for about 10 or 15 years.
Very interesting. Although what's different here is that the slower pace is only the public side. The private side is going probably faster than ever!
I assume that you value Robert Wright and his guests. I heard the same inspiring thought from Tim B Lee in their conversation last week. Old school journalist focus. Good source, if true.
Really?? Nice. I didn't get that. I guess the evidence is there and you just have to look in the right direction!
You mentioned using a large model to “distill” a smaller model but I would point out that even with existing technology you could give a model more test time compute to train a model that uses less test time compute.
The move then to use enormous amounts of both test time compute and large model size and distill the results into smaller usable models.
Yep. That's definitely a possibility. They're using o1/o3 kind of models for the next generation for sure. I just didn't want to make the story too complicated!
Spooky!
I actually did kind of think that I’d get to be an early adopter of AI and I’d finally be ahead of the curve on a world changing invention. (Still kicking myself about that $20 of Bitcoin I misplaced decades ago.)
But yeah, makes more sense they’d use it to achieve escape velocity.
Good thing I’ve been saying “I for one welcome our robot overlords” on Reddit since like 2005.
Whether you’re objectively right or wrong, this is compelling because we have to assume it was the natural progression. You can’t defy the laws of economics (ie: capitalism) - if you had free energy would you release it to the world, or distill it into derivative products that were better than current but left a pathway for growth, selling it at what the market would bear? Brilliant work Alberto!
this is a great thought experiment, even just for lay people.
Wow. Most interesting piece thus far on here. Thanks Alberto. I guess AGI will be writing this soon LOL. Really enjoy your work as always.
Thanks John! Thankfully, I think I have a comparative advantage here: AGI is best used elsewhere and having an audience is one of the few edges that will remain in a post-AGI world
Frank Herbert would disagree.
This is such a fascinating deep dive into the behind-the-scenes of AI development! The idea that GPT-5 might already exist but is being kept internal for distillation and cost control makes a lot of sense, especially given the parallels with Anthropic’s Opus 3.5. It’s wild to think that the most advanced models might never see the light of day, instead serving as ‘teacher models’ to power the ones we actually use. If true, it changes how we think about the AI race—less about public releases and more about hidden, recursive self-improvement. Exciting and slightly unsettling at the same time!
Right! But we will see better models nevertheless! The reasoning models, distilled one after another, will see the light. It's the super big base models that won't make sense to deploy anymore
It’s almost like we’re entering an era where the most powerful AI systems are like black boxes within black boxes—training each other in ways we can’t fully see or comprehend. Exciting, but also a little mind-bending!
Very interesting. I thought i would have to put on my tin foil hat for this article, but it made a lot of sense OR maybe i never really put down the tin foil hat...
Thanks Thomas. I would think the same with that headline haha but the story deserved it!
I'm into more charitable takes so here is my two cents:
I think both Opus 3.5 and GPT-5 finished training well and yes, they were used for distillation and not served directly. But I think the main reason of not serving them to users is that it would eat up too much of the compute that is better spent on generating more synthetic data. Synthetic data improves the overall quality of models beyond the models that generated it, due to techniques like rejection sampling.
If they had more GPUs, these models would've seen the light of day by now.
Right. But that's exactly what I said. That's why the part of "contingent on costs" is so important. If OpenAI were 10x richer, they'd do this with GPT-6 instead of GPT-5. And so on. The thing is that they have finite money but infinite ambition to keep improving their AIs. Once the ambition surpasses the amount of cash you can spend on inference then you keep the models private
I don't think its just the money. Even with all the money in the world, nobody in the world, even OpenAI, can get all the GPUs they want/need.
I'm using money as proxy for compute. If they had more GPUs then they'd deploy GPT-5 but then the same would happen with GPT-6. What I mean is that yes, compute is a constraint but that's not the point: it would have happened sooner or later anyway because inference is much more expensive than training
"But training? A piece of cake"
I disagree with the notion that a 100 trillion parameter model would be a piece of cake to train.
I'm wondering if you mis-spoke there or maybe I'm missing something? Atleast in the context of this quote, it seems like you're referring to a 100 trillion parameter model that is mentioned by the EpochAI person.
Even if you just mean 100 trillion parameters total for an MoE that actually has much less active parameters, the estimated active parameters of such a model with the same MoE activation sparsity as GPT-4, would still end up as around 15 trillion active parameters, and that would require ~80M H100s worth of compute to train with typical training time of ~3 months with chinchilla optimal scaling. It's generally agreed by the best analysts in the space that no such cluster of that size exists yet. The newest clusters only being built in the past few months around 10-15X the scale of GPT-4.
Also this comment: "Aren’t they risking their reputation by repeatedly delaying the model?"
Maybe I'm missing something with this too, but I can't recall a time when OpenAI "delayed GPT-5", or even a time when an OpenAI employee ever said it was coming within a certain timeframe and it didn't? Even if you look at the time gap for GPT-3 to GPT-4 that was a 33 month gap, but it hasn't even been 24 months since GPT-4 released yet though, so even on a larger timeline extrapolation it seems unfounded to say GPT-5 is late or delayed. Even if you extrapolate longer term changes across the release dates of each GPT generation, you get an even longer time for GPT-5 of ~60 months, or about 5 years.
It was a bit of exaggeration for emphasis in the contrast between training and inference. Training a 100T model is anything but a piece of cake! Shouldn't be taken literally
I think it is however worth keeping in mind that the odds of OpenAI already having such scale of compute to train such a model is very unlikely, It was only around 6 months ago that Microsoft is said to have built the worlds first cluster with a hypothetical GPT-4.5 scale of compute, multiple times smaller than what you'd need for this model scale you were describing.
To be generous to your main point though, I think a GPT-4.5 scale model would still be fairly significant, just like GPT-3.5 to 4 was, or like GPT-3 to 3.5 was. And I think it's likely imo that some distillation from such a recent GPT-4.5 scale model is being used in the process of distilling down to smaller models (for example they might be working on releasing a significant GPT-4o update soon alongside a GPT-4.5 model, or maybe replacing GPT-4o with a GPT-4.5-mini) Just like how GPT-4o-mini was developed and became a cheaper and better replacement to GPT-3.5-turbo. However it seems like they still plan to release this newly developed internal GPT-4.5 scale model, It was recently found by multiple people on twitter that there is now multiple references to a "GPT-4.5-preview" added to ChatGPT code.
Also keep in mind that GPT-4o was already released before these GPT-4.5 scale models likely even started training, the current GPT-4o model doesn't seem much better imo than the GPT-4o model that came out months ago in May. So the odds of GPT-4o having benefitted significantly from even a new GPT-4.5 scale internal un-released model seems low to me, and seems more likely to just be a result of improved training techniques imo. Likely just general training tweaks and improvements that allows the model to squeeze more out of the same limited training data, while keeping parameter count still relatively small, and/or distillation from various domain specialist systems/models internally.
I agree. But the 100T number was just an exemplification by EpochAI. I very much doubt GPT-5 is that big. Perhaps not even an order of magnitude less (5-10T). They used the number to explain that current hardware makes it possible.
yes. If we keep all factors equal, historically it has been around 100X compute scale difference between full GPT generation leaps. If MoE sparsity is kept the same, with chinchilla optimal parameter scaling continuing, then GPT-4.5 scale models would be around 1T active params with 6T total params and then for a GPT-5 scale model you would have; ~3T active params with 20T total params.
That GPT-5 scale still results in a cluster that isn't expected to be built until around mid-2025 atleast. Over 10X more compute than current runs. But ofcourse its always possible they just name this currently training GPT-4.5 scale model to "GPT-5" while having an even smaller model that is actually being called "GPT-4.5" haha, but I think that'd be a bit risky for PR unless they've really come along with some great breakthroughs in scaling efficiency significantly beyond even Deepseek V3. Like I mentioned earlier, there is evidence recently that suggests GPT-4.5-preview is already getting ready to be rolled out. It's always possible they're doing some weird naming stuff or something though alongside big algorithmic breakthroughs in scaling, who knows. I suppose we'll see soon enough what they plan to roll out.
Perhaps o3, which is following on so quickly after o1, is the reasoning model based on full GPT-5 as opposed to o1, which would be based on 4o or another distillation of GPT-5. Clearly the usage fee for o3 will be much higher, justifying the higher inference cost of running full GPT-5.
That's a possibility. I wonder if they will announce o4 soon, keeping the pace between o1 - o3 - o4. In that cases it can't be a different base model each time because those take longer to train. I believe they're really stepping on the gas here
I would imagine if one had a large frontier model to use for training smaller models by RLAIF in terms of tuning for instruction to specific preference output, such as reasoning, deduction, conceptualization, cross-domain training, etc., then one will have many different models which can then be stitched together for a comprehensive MoE. Why stop there? One can do different multi-model type models for speech, vision, physics and movement, video, music. Then do more cross-domain training. Once put together as a super-MoE, then one would have their ASI.
Right. Not sure about ASI but AGI? Certainly. I don't think we will be able to keep pace even merely as observers.
Alberto, great post and thanks. What do you think this means for US government efforts, via the recently released AI Diffusion rule, to attempt to control global AI data center compute capacity by setting a threshold on frontier model training compute? Does this make sense in the situation you are describing with the leading AI labs? What did you make of the claim by DeepSeek that it had used a much smaller number of GPUs to train it's model V3, compared to the leading AI labs?
I'm not sure the AI diffusion rule says anything about a threshold on model training? As far as I know that was the executive order and then the SB 1047 vetoed by governor Newsom. But in any case, the US government will be focused on the geopolitics of AI this year and onward. China is catching up and OpenAI has a close relationship with Washington (like big tech does). The US government will get its hands on anything OpenAI does before we do. I don't think that should surprise anyone. About DeepSeek I believe it's possible. They're doing great and we should simply not pretend anymore that China is behind the US. It is not. In strict innovation terms it might be (OpenAI and DeepMind are older than the rest) but perhaps not for long.
Is this what Google is doing too? It’s incredible how Gemini 2.0 Flash is even better than Claude 3 Opus was, even though it is way way smaller
I'm sure Google DeepMind has something like it as well, yes. They're not behind OpenAI by any meaningful metric