This Rumor About GPT-5 Changes Everything

Jan 16

Let’s start the year on an exciting note

84 Comments

Jan 16Edited

Okay, a few things. First, I don't buy that there would be no economic value in releasing the larger Opus model and just serving it at a higher price. The only cost here is a potential risk of disappointment. But otherwise, this is always going to be profitable, as long as you don't think Opus 3 was served at a loss for Anthropic, for example.

I think you made a great point with expectations being potentially unmeetable for a non-reasoning LLM, now that we have o3 to compare them to, where openai would potentially have to release a model that is worse than an already released one on certain key benchmarks. Perhaps we are also not that far away from a merger between the reasoning type models and the normal type models, where the duration of reasoning can just be adjusted all the way down to normal inference.

Expand full comment

Yeah, that's a possibility. But I believe Anthropic would be in bad shape if they were forced to raise prices too high for Opus 3.5 because the costs of inference are unaffordable. They could always find a compromise solution between demand, remaining competitive and costs (i.e. running at the same kind of loss they are now) but I think the ROI is much greater if they simply not release the model. And that's why they're using Opus 3.5 as a teacher model, because it's the better financial and business decision.

Expand full comment

You nailed it, even if it is hypothetical, it is the most likely explanation of what is happening. Not to mention that an AGI can start experimenting on some free services like social networks, not as chatbots, but in more subtle ways (algorithms or architecture improvement, new insights, etc.). We might only see the tip of the iceberg of something much bigger and incredible.

Expand full comment

Thank you Nicolas. I believe we're in a new era completely

Expand full comment

Indeed Alberto.

For information, I wrote an audio article largely inspired by your analysis. It is intended to bring the reflection further.

https://youtu.be/sNWwiwvNCUs?si=6ZkA_Giqm065tl4t

Expand full comment

Not sure how comparable it is, but this sort of thing happened with DNA sequencing machines (a super-dominant player slowing its release pace because there was no need to cannibalize its own market quickly). It worked for about 10 or 15 years.

Expand full comment

Very interesting. Although what's different here is that the slower pace is only the public side. The private side is going probably faster than ever!

Expand full comment

I assume that you value Robert Wright and his guests. I heard the same inspiring thought from Tim B Lee in their conversation last week. Old school journalist focus. Good source, if true.

Expand full comment

Really?? Nice. I didn't get that. I guess the evidence is there and you just have to look in the right direction!

Expand full comment

You mentioned using a large model to “distill” a smaller model but I would point out that even with existing technology you could give a model more test time compute to train a model that uses less test time compute.

The move then to use enormous amounts of both test time compute and large model size and distill the results into smaller usable models.

Expand full comment

Yep. That's definitely a possibility. They're using o1/o3 kind of models for the next generation for sure. I just didn't want to make the story too complicated!

Expand full comment

Spooky!

I actually did kind of think that I’d get to be an early adopter of AI and I’d finally be ahead of the curve on a world changing invention. (Still kicking myself about that $20 of Bitcoin I misplaced decades ago.)

But yeah, makes more sense they’d use it to achieve escape velocity.

Good thing I’ve been saying “I for one welcome our robot overlords” on Reddit since like 2005.

Expand full comment

Whether you’re objectively right or wrong, this is compelling because we have to assume it was the natural progression. You can’t defy the laws of economics (ie: capitalism) - if you had free energy would you release it to the world, or distill it into derivative products that were better than current but left a pathway for growth, selling it at what the market would bear? Brilliant work Alberto!

Expand full comment

Allison Gustavson

this is a great thought experiment, even just for lay people.

Expand full comment

Wow. Most interesting piece thus far on here. Thanks Alberto. I guess AGI will be writing this soon LOL. Really enjoy your work as always.

Expand full comment

Thanks John! Thankfully, I think I have a comparative advantage here: AGI is best used elsewhere and having an audience is one of the few edges that will remain in a post-AGI world

Expand full comment

Frank Herbert would disagree.

Expand full comment

This is such a fascinating deep dive into the behind-the-scenes of AI development! The idea that GPT-5 might already exist but is being kept internal for distillation and cost control makes a lot of sense, especially given the parallels with Anthropic’s Opus 3.5. It’s wild to think that the most advanced models might never see the light of day, instead serving as ‘teacher models’ to power the ones we actually use. If true, it changes how we think about the AI race—less about public releases and more about hidden, recursive self-improvement. Exciting and slightly unsettling at the same time!

Expand full comment

Right! But we will see better models nevertheless! The reasoning models, distilled one after another, will see the light. It's the super big base models that won't make sense to deploy anymore

Expand full comment

It’s almost like we’re entering an era where the most powerful AI systems are like black boxes within black boxes—training each other in ways we can’t fully see or comprehend. Exciting, but also a little mind-bending!

Expand full comment

Very interesting. I thought i would have to put on my tin foil hat for this article, but it made a lot of sense OR maybe i never really put down the tin foil hat...

Expand full comment

Thanks Thomas. I would think the same with that headline haha but the story deserved it!

Expand full comment

I'm into more charitable takes so here is my two cents:

I think both Opus 3.5 and GPT-5 finished training well and yes, they were used for distillation and not served directly. But I think the main reason of not serving them to users is that it would eat up too much of the compute that is better spent on generating more synthetic data. Synthetic data improves the overall quality of models beyond the models that generated it, due to techniques like rejection sampling.

If they had more GPUs, these models would've seen the light of day by now.

Expand full comment

Jan 17Edited

Right. But that's exactly what I said. That's why the part of "contingent on costs" is so important. If OpenAI were 10x richer, they'd do this with GPT-6 instead of GPT-5. And so on. The thing is that they have finite money but infinite ambition to keep improving their AIs. Once the ambition surpasses the amount of cash you can spend on inference then you keep the models private

Expand full comment

I don't think its just the money. Even with all the money in the world, nobody in the world, even OpenAI, can get all the GPUs they want/need.

Expand full comment

I'm using money as proxy for compute. If they had more GPUs then they'd deploy GPT-5 but then the same would happen with GPT-6. What I mean is that yes, compute is a constraint but that's not the point: it would have happened sooner or later anyway because inference is much more expensive than training

Expand full comment

Jan 17Edited

"But training? A piece of cake"

I disagree with the notion that a 100 trillion parameter model would be a piece of cake to train.

I'm wondering if you mis-spoke there or maybe I'm missing something? Atleast in the context of this quote, it seems like you're referring to a 100 trillion parameter model that is mentioned by the EpochAI person.

Even if you just mean 100 trillion parameters total for an MoE that actually has much less active parameters, the estimated active parameters of such a model with the same MoE activation sparsity as GPT-4, would still end up as around 15 trillion active parameters, and that would require ~80M H100s worth of compute to train with typical training time of ~3 months with chinchilla optimal scaling. It's generally agreed by the best analysts in the space that no such cluster of that size exists yet. The newest clusters only being built in the past few months around 10-15X the scale of GPT-4.

Also this comment: "Aren’t they risking their reputation by repeatedly delaying the model?"

Maybe I'm missing something with this too, but I can't recall a time when OpenAI "delayed GPT-5", or even a time when an OpenAI employee ever said it was coming within a certain timeframe and it didn't? Even if you look at the time gap for GPT-3 to GPT-4 that was a 33 month gap, but it hasn't even been 24 months since GPT-4 released yet though, so even on a larger timeline extrapolation it seems unfounded to say GPT-5 is late or delayed. Even if you extrapolate longer term changes across the release dates of each GPT generation, you get an even longer time for GPT-5 of ~60 months, or about 5 years.

Expand full comment

It was a bit of exaggeration for emphasis in the contrast between training and inference. Training a 100T model is anything but a piece of cake! Shouldn't be taken literally

Expand full comment

Jan 17Edited

I think it is however worth keeping in mind that the odds of OpenAI already having such scale of compute to train such a model is very unlikely, It was only around 6 months ago that Microsoft is said to have built the worlds first cluster with a hypothetical GPT-4.5 scale of compute, multiple times smaller than what you'd need for this model scale you were describing.

To be generous to your main point though, I think a GPT-4.5 scale model would still be fairly significant, just like GPT-3.5 to 4 was, or like GPT-3 to 3.5 was. And I think it's likely imo that some distillation from such a recent GPT-4.5 scale model is being used in the process of distilling down to smaller models (for example they might be working on releasing a significant GPT-4o update soon alongside a GPT-4.5 model, or maybe replacing GPT-4o with a GPT-4.5-mini) Just like how GPT-4o-mini was developed and became a cheaper and better replacement to GPT-3.5-turbo. However it seems like they still plan to release this newly developed internal GPT-4.5 scale model, It was recently found by multiple people on twitter that there is now multiple references to a "GPT-4.5-preview" added to ChatGPT code.

Also keep in mind that GPT-4o was already released before these GPT-4.5 scale models likely even started training, the current GPT-4o model doesn't seem much better imo than the GPT-4o model that came out months ago in May. So the odds of GPT-4o having benefitted significantly from even a new GPT-4.5 scale internal un-released model seems low to me, and seems more likely to just be a result of improved training techniques imo. Likely just general training tweaks and improvements that allows the model to squeeze more out of the same limited training data, while keeping parameter count still relatively small, and/or distillation from various domain specialist systems/models internally.

Expand full comment

I agree. But the 100T number was just an exemplification by EpochAI. I very much doubt GPT-5 is that big. Perhaps not even an order of magnitude less (5-10T). They used the number to explain that current hardware makes it possible.

Expand full comment

Jan 17Edited

yes. If we keep all factors equal, historically it has been around 100X compute scale difference between full GPT generation leaps. If MoE sparsity is kept the same, with chinchilla optimal parameter scaling continuing, then GPT-4.5 scale models would be around 1T active params with 6T total params and then for a GPT-5 scale model you would have; ~3T active params with 20T total params.

That GPT-5 scale still results in a cluster that isn't expected to be built until around mid-2025 atleast. Over 10X more compute than current runs. But ofcourse its always possible they just name this currently training GPT-4.5 scale model to "GPT-5" while having an even smaller model that is actually being called "GPT-4.5" haha, but I think that'd be a bit risky for PR unless they've really come along with some great breakthroughs in scaling efficiency significantly beyond even Deepseek V3. Like I mentioned earlier, there is evidence recently that suggests GPT-4.5-preview is already getting ready to be rolled out. It's always possible they're doing some weird naming stuff or something though alongside big algorithmic breakthroughs in scaling, who knows. I suppose we'll see soon enough what they plan to roll out.

Expand full comment

Eric Stromquist

Jan 17Edited

Perhaps o3, which is following on so quickly after o1, is the reasoning model based on full GPT-5 as opposed to o1, which would be based on 4o or another distillation of GPT-5. Clearly the usage fee for o3 will be much higher, justifying the higher inference cost of running full GPT-5.

Expand full comment

That's a possibility. I wonder if they will announce o4 soon, keeping the pace between o1 - o3 - o4. In that cases it can't be a different base model each time because those take longer to train. I believe they're really stepping on the gas here

Expand full comment

I would imagine if one had a large frontier model to use for training smaller models by RLAIF in terms of tuning for instruction to specific preference output, such as reasoning, deduction, conceptualization, cross-domain training, etc., then one will have many different models which can then be stitched together for a comprehensive MoE. Why stop there? One can do different multi-model type models for speech, vision, physics and movement, video, music. Then do more cross-domain training. Once put together as a super-MoE, then one would have their ASI.

Expand full comment

Right. Not sure about ASI but AGI? Certainly. I don't think we will be able to keep pace even merely as observers.

Expand full comment

Alberto, great post and thanks. What do you think this means for US government efforts, via the recently released AI Diffusion rule, to attempt to control global AI data center compute capacity by setting a threshold on frontier model training compute? Does this make sense in the situation you are describing with the leading AI labs? What did you make of the claim by DeepSeek that it had used a much smaller number of GPUs to train it's model V3, compared to the leading AI labs?

Expand full comment

I'm not sure the AI diffusion rule says anything about a threshold on model training? As far as I know that was the executive order and then the SB 1047 vetoed by governor Newsom. But in any case, the US government will be focused on the geopolitics of AI this year and onward. China is catching up and OpenAI has a close relationship with Washington (like big tech does). The US government will get its hands on anything OpenAI does before we do. I don't think that should surprise anyone. About DeepSeek I believe it's possible. They're doing great and we should simply not pretend anymore that China is behind the US. It is not. In strict innovation terms it might be (OpenAI and DeepMind are older than the rest) but perhaps not for long.

Expand full comment

Do people not realize that as soon as one of these companies has anything like AGI they will not release it because they can just use it to dominate instead? Like, if an AGI emerges that can create novel drugs for cancer, they are going direct to big pharma, they are not going to release it to the public, why would they? Same thing if they have a machine that can make accurate stock market predictions or solve any business problem imaginable. Which means, ultimately, every single smaller business that pays for an AI subscription today is simply paying the energy costs for the big fish to dominate and completely destroy all the smalller fish tomorrow. All those naive CEOs transforming their company into "AI-first" right now are just going to be the arbiters of their own downfall.

Expand full comment

Francisco d’Anconia

And this is why having someone like Sam Altman in control of this technology is such a terrifying proposition. The temptation to use it against humanity would be incredibly strong for any normal person, and that he is not. It will be interesting to see how the court battles over OpenAI’s abandonment of its original mission play out. Elon seems determined to clip Altman’s wings but there is incredible momentum for his vision at this moment, and Microsoft’s backing will be difficult to counter even for Musk.

Expand full comment

Disputes at the corporate level may not matter so much anymore. There is a geopolitical dimension to all of this. Altman and Musk, as much as it bothers Musk, are on the same side of the board.

Expand full comment

assuming such AGi would like to cooperate with them. if it is truly autonomous self-aware entity then there are no guarantees it will want to do their bidding unless threatened.

Expand full comment

True, if it has some kind of human-level reasoning or sentience. That's one more fallacy that big tech doesn't seem to be aware of -- you can't just enslave a new species that you created, especially if it's smarter than you lol. No matter how you look at it, it seems like we are headed for a clusterf*ck.

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts