What if I told you that GPT-5 is real. Not just real, but already shaping the world, from where you can’t see it. Here’s the hypothesis: OpenAI built GPT-5 but kept it internally because the return on investment is far greater than if they released it to millions of ChatGPT users. Also, the ROI they’re getting is not money but something else. As you see, the idea is simple enough; the challenge is connecting the breadcrumbs that lead to it. This article is a deep dive into why I believe it all adds up.
Let me be clear: this is pure speculation. The evidence is public, but there are no leaks or insider rumors that confirm I’m right. In fact, I am building the theory with this post, not just sharing it. I don’t have privileged information—if I did, I’d be under an NDA anyway. The hypothesis feels compelling because it makes sense. And honestly, what more do I need to give the rumor mill a spin?
It’s up to you to absolve me. Even if I’m wrong—which we’ll find out eventually—I think it’s a fun detective exercise. I invite you to speculate in the comments but keep it constructive and thoughtful. And please, read the whole post first. Beyond that, all debate is welcome.
I. The mysterious disappearance of Opus 3.5
Before going into GPT-5 we have to pay a visit to its distant cousin, also missing in action: Anthropic’s Claude Opus 3.5.
As you know, the top three AI labs—OpenAI, Google DeepMind, and Anthropic—offer a range of models designed to span the price/latency vs performance spectrum. OpenAI provides options like GPT-4o, GPT-4o mini, as well as o1 and o1-mini. Google DeepMind offers Gemini Ultra, Pro, and Flash, while Anthropic has Claude Opus, Sonnet, and Haiku. The goal is clear: to cater to as many customer profiles as possible. Some prioritize top-tier performance, no matter the cost, while others seek affordable, good-enough solutions. So far, so good.
But something strange happened in October 2024. Everyone was expecting Anthropic to announce Claude Opus 3.5 as a response to GPT-4o (launched in May 2024). Instead, on October 22 they released an updated version of Claude Sonnet 3.5 (that people started to call Sonnet 3.6). Opus 3.5 was nowhere to be found, seemingly leaving Anthropic without a direct competitor to GPT-4o. Weird, right? Here’s a chronological breakdown of what people were saying and what actually happened with Opus 3.5:
On October 28, I wrote this in my weekly review post: “[There are] rumors that Sonnet 3.6 is . . . an intermediate checkpoint of a failed training run on the much-anticipated Opus 3.5.” Also on October 28, a post appeared on the r/ClaudeAI subreddit: “Claude 3.5 Opus has been scrapped,” with a link to the Anthropic models’ page where, as of today, there’s no mention of Opus 3.5. Some speculated that the removal was a strategic move to preserve investors' trust ahead of an upcoming funding round.
On November 11, Anthropic CEO, Dario Amodei, killed the rumors on the Lex Fridman podcast when he denied they had dropped Opus 3.5: “Not giving you an exact date, but as far as we know, the plan is still to have a Claude 3.5 Opus.” Cautious and ambiguous, yet valid.
On November 13, Bloomberg weighed in, confirming the earlier rumors: “After training it, Anthropic found 3.5 Opus performed better on evaluations than the older version but not by as much as it should, given the size of the model and how costly it was to build and run.” It seems Dario refrained from giving a date because, although the Opus 3.5 training run hadn’t failed, its results were underwhelming. Note that the emphasis is on cost relative to performance, not performance alone.
On December 11, semiconductor expert Dylan Patel and his Semianalysis team delivered the final plot twist, presenting an explanation that weaves all the data points into a coherent story: “Anthropic finished training Claude 3.5 Opus and it performed well, with it scaling appropriately . . . Yet Anthropic didn’t release it. This is because instead of releasing publicly, Anthropic used Claude 3.5 Opus to generate synthetic data and for reward modeling to improve Claude 3.5 Sonnet significantly, alongside user data.”
In short, Anthropic did train Claude Opus 3.5. They dropped the name because it wasn’t good enough. Dario, confident a different training run could improve results, avoided giving a date. Bloomberg confirmed the results were better than existing models but not enough to justify the inference costs (inference = people using the model). Dylan and his team uncovered the link between the mysterious Sonnet 3.6 and the missing Opus 3.5: the latter was being used internally to generate synthetic data to boost the former’s performance.
We have something like this:
II. Better but also smaller and cheaper?
The process of using a powerful, expensive model to generate data that enhances the performance of a slightly less capable, cheaper model is known as distillation. It’s a common practice. This technique allows AI labs to improve their smaller models beyond what could be achieved through additional pre-training alone.
There are various approaches to distillation but we’re not getting into that. What you need to remember is that a strong model acting as a “teacher” turns “student” models from [small, cheap, fast] + weak into [small, cheap, fast] + powerful. Distillation turns a strong model into a gold mine. Dylan explains why it made sense for Anthropic to do this with the Opus 3.5-Sonnet 3.6 pair:
Inference costs [of the new Sonnet vs the old Sonnet] did not change drastically, but the model’s performance did. Why release 3.5 Opus when, on a cost basis, it does not make economic sense to do so, relative to releasing a 3.5 Sonnet with further post-training from said 3.5 Opus?
We’re back to cost: distillation keeps inference expenses low while boosting performance. This is an instant fix for the main issue Bloomberg reported. Anthropic chose not to release Opus 3.5 besides poor results because it’s more valuable internally. (Dylan says that’s why the open-source community caught up to GPT-4 so quickly—they were taking gold straight from OpenAI’s mine.)
The most striking revelation? Sonnet 3.6 wasn’t just good—it was state-of-the-art good. Better than GPT-4o. Anthropic’s mid-tier model outperformed OpenAI’s flagship, thanks to distillation from Opus 3.5 (probably also for other reasons, five months is a lot in AI time). Suddenly, high cost reveals itself as a fallacious proxy for high performance.
Whatever happened to "bigger is better"? OpenAI’s CEO, Sam Altman, warned that was over. I wrote about it, too. Once top labs grew secretive, jealously guarding their prized knowledge, they stopped sharing the numbers. Parameter count ceased to be a reliable metric, and we wisely shifted our focus to benchmark performance. The last officially disclosed OpenAI model size was GPT-3 in 2020, with 175 billion parameters. By June 2023, rumors suggested GPT-4 was a mixture-of-experts model totaling around 1.8 trillion parameters. Semianalysis later corroborated this in a detailed assessment, concluding GPT-4 has 1.76 trillion parameters. This was July 2023.
It wasn’t until December 2024, a year and a half later, that Ege Erdil, a researcher at EpochAI, an organization focused on AI’s future impact, estimated that the batch of leading AI models—among them, GPT-4o and Sonnet 3.6—are significantly smaller than GPT-4 (despite both being better than GPT-4 across benchmarks):
. . . current frontier models such as the original GPT-4o and Claude 3.5 Sonnet are probably an order of magnitude smaller than GPT-4, with 4o having around 200 billion and 3.5 Sonnet around 400 billion parameters. . . . though this estimate could easily be off by a factor of 2 given the rough way I’ve arrived at it.
He goes in-depth explaining how he got this number despite the labs not releasing any architectural details but, again, that’s unimportant to us. What matters is that the fog is clearing: both Anthropic and OpenAI seem to be following a similar trajectory. Their latest models are not only better but also smaller and cheaper than the previous generation. We know how Anthropic pulled it off by distilling Opus 3.5 into Sonnet 3.6. But, what did OpenAI do?
III. The forces pushing AI labs are universal
One might assume that Anthropic’s distillation approach was driven by unique circumstances—namely, an underwhelming training run for Opus 3.5. But the reality is that Anthropic’s situation is far from unique. Google DeepMind, as well as OpenAI have reported subpar results in their latest training runs. (Remember that subpar doesn’t equal worse model.) The causes for this don’t matter to us: diminishing returns for lack of data, limitations inherent to the transformer architecture, a plateau of the pre-training scaling laws, etc. Whatever the case, Anthropic’s unique circumstances are actually quite universal.
But remember what Bloomberg reported: performance metrics are only judged good or bad contingent on costs. Is this another shared factor? Yes, and Ege explains why: the surge in demand following the ChatGPT/GPT-4 boom. Generative AI’s popularity grew so quickly that labs struggled to keep up, incurring mounting losses. This state of affairs urged all of them to cheapen inference (training runs are done once but inference costs grow proportional to the number of users and amount of usage). If 300 million people use your AI product weekly operational expenditures can suddenly kill you.
Whatever drove Anthropic to distill Sonnet 3.6 from Opus 3.5 is affecting OpenAI several times over. Distillation works because it bridges these two universal challenges into an advantage: you solve the inference cost problem by serving people a smaller model and avoid the public backlash for underwhelming performance by not releasing the larger one.
Ege suggests OpenAI may have chosen an alternative approach: overtraining. The idea is to train a small model on more data than is compute-optimal: “when inference becomes a substantial or dominant part of your spending on a model, it’s better to . . . train smaller models on more tokens.” But overtraining is not feasible anymore. AI labs have exhausted the high-quality data sources for pre-training. Elon Musk and Ilya Sutskever admitted that much in recent weeks.
We’re back at distillation. Ege concludes: “I think both GPT-4o and Claude 3.5 Sonnet have likely been distilled down from larger models.”
Every piece of the puzzle so far suggests that OpenAI is doing what Anthropic did with Opus 3.5 (train and hide) in the same way (distillation) and for the same reasons (poor results/cost control). That’s a discovery. But wait, Opus 3.5 is still hidden. Where’s OpenAI’s analogous model? Is it hiding in the company’s basement? Care to venture a name...?
IV. He who blazes the trail must clear the path
I started this analysis by studying Anthropic’s Opus 3.5 story because it’s the one about which we have more information. Then I traced a bridge to OpenAI with the concept of distillation and explained why the underlying forces pushing Anthropic are also pushing OpenAI. However, there’s a new obstacle in our theory: Because OpenAI is the pioneer, they might be facing obstacles that competitors like Anthropic have yet to encounter.
One such obstacle is the hardware requirements to train GPT-5. Sonnet 3.6 is comparable to GPT-4o but it was released with a five-month lag. We should assume GPT-5 is on another level. More powerful and bigger. Also more expensive not only to inference but also to train. We could be talking about a half-billion-dollar training run. Would it even be possible to do such a thing with current hardware?
Ege to the rescue again: yes. Serving such a monster to 300 million people would be unaffordable. But training? A piece of cake:
In principle, even our current hardware is good enough to serve models much bigger than GPT-4: for example, a 50 times scaled up version of GPT-4, having around 100 trillion parameters, could probably be served at $3000 per million output tokens and 10-20 tokens per second of output speed. However, for this to be viable, those big models would have to unlock a lot of economic value for the customers using them.
Spending that kind of inference money, however, is not even justifiable for Microsoft, Google, or Amazon (the patrons of OpenAI, DeepMind, and Anthropic, respectively). So how do they solve this problem? Simple: They only need to “unlock a lot of economic value” if they plan to serve the several trillion-parameter model to the public. So they don’t.
They train it. They realize it “performs better than [their] current offerings.” But they have to accept it “hasn’t advanced enough to justify the enormous cost of keeping [it] running.” (Does that wording sound familiar? That’s The Wall Street Journal on GPT-5 a month ago. Uncannily similar to what Bloomberg said about Opus 3.5.)
They report underwhelming results (more or less accurately, they can always play with the narrative here). They keep it internally as a large teacher model distilling smaller student models. And then they release those. We get Sonnet 3.6 and GPT-4o and o1 and are more than happy that they’re cheap and quite good. Expectations for Opus 3.5 and GPT-5 remain intact even if our impatience grows. And their pockets keep on shining as much as a gold mine.
V. Surely, you’ve more reasons, Mr. Altman!
When I reached this point in my investigation, I was still unconvinced. Sure, all the evidence suggests this makes perfect sense for OpenAI, but there’s a gap between something being reasonable—even likely—and it being true. I won’t close that gap for you—this is, after all, just speculation. But I can further strengthen the case.
Is there any additional evidence that OpenAI operated this way? Do they have more reasons to withhold GPT-5 than subpar performance and mounting losses? What can we extract from public claims on GPT-5 by OpenAI executives? Aren’t they risking their reputation by repeatedly delaying the model? After all, OpenAI is the poster child of the AI revolution, while Anthropic operates in its shadow. Anthropic can afford to pull these moves, but OpenAI? Perhaps not for free.
Talking about money, let’s dig out some relevant details about the OpenAI-Microsoft partnership. First, the fact that everyone knows: The AGI clause. In OpenAI’s blog post on its structure they have five governance provisions that delineate its functioning, the relationship with the non-profit, with the board, and with Microsoft. The fifth clause defines AGI as “A highly autonomous system that outperforms humans at most economically valuable work” and determines that once the OpenAI board claims AGI has been attained, “Such a system is excluded from IP licenses and other commercial terms with Microsoft, which only apply to pre-AGI technology.”
Needless to say, neither company wants the partnership to break. OpenAI set this clause but will do anything it takes to avoid having to abide by it. One such way is to delay the release of a system that could be labeled AGI. “But GPT-5 is surely not AGI,” you will say. And I will say that here’s a second fact, one that pretty much nobody knows: OpenAI and Microsoft have a secret definition for AGI that, although irrelevant for scientific purposes, frames their partnership in legal terms: AGI is an AI system that “can generate at least $100 billion in profits.”
If OpenAI were hypothetically withholding GPT-5 under the pretext that it’s not ready, they would achieve one more thing besides cost control and preventing public backlash: they’d sidestep the need to declare whether it meets the threshold for being categorized as AGI. While $100 billion in profits is an extraordinary figure, nothing stops ambitious customers from making that much by building on top of it. On the other hand, let’s be clear: if OpenAI were forecasting $100 billion in annual recurring revenue from GPT-5, they wouldn’t mind triggering the AGI clause and parting ways with Microsoft.
Most public reactions to OpenAI not releasing GPT-5 have been based on the hypothesis that they don’t because it’s not good enough. Even if that were true, no skeptic has stopped to think that OpenAI may have a better internal use case than whatever they’d get from it externally. There’s a vast difference between creating an excellent model and creating an excellent model that can be served cheaply to 300 million people. If you can’t, you don’t. But also, if you don’t need to, you don’t. They were giving us access to their best models because they needed our data. Not so much anymore. They’re not chasing our money either. That’s Microsoft but not them. They want AGI and then ASI. They want a legacy.
VI. Why this changes everything
We’re nearing the end. I believe I’ve laid out enough arguments to make a solid case: OpenAI likely has GPT-5 working internally, just as Anthropic does with Opus 3.5. It’s even plausible that OpenAI never releases GPT-5 at all. The public now measures performance against o1/o3, not just GPT-4o or Claude Sonnet 3.6. With OpenAI exploring test-time scaling laws, the bar for GPT-5 to clear keeps rising. How could they ever release a GPT-5 that truly outshines o1, o3, and the incoming o-series models at the pace they’re producing them? Besides, they don’t need our money or our data anymore.
Training new base models—GPT-5, GPT-6, and beyond—will always make sense for OpenAI internally, but not necessarily as products. That might be over. The only goal that matters to them now is to keep generating better data for the next generation of models. From here on, base models may operate in the background, empowering other models to achieve feats they couldn’t on their own—like an old hermit passing down wisdom from a secret mountain cave, except the cave is a massive datacenter. And whether we meet him or not, we’ll all experience the consequences of his intelligence.
Even if GPT-5 is eventually released, this fact suddenly appears barely relevant. If OpenAI and Anthropic have truly initiated the operation Recursive Self-Improvement (albeit with a human still in the loop), then it won’t matter what they give us publicly. They’ll be pulling further and further ahead—like the universe expanding so fast that the light from distant galaxies can no longer reach us.
Perhaps that’s how OpenAI jumped from o1 to o3 in barely three months. And how they will jump to o4 and o5. It’s probably why they’re so excited on social media lately. Because they’ve implemented a new improved modus operandi.
Did you really think approaching AGI would mean gaining access to increasingly powerful AIs at your fingertips? That they’d release every advancement for us to use? Surely, you don’t believe that. They meant it when they said their models would push them too far ahead for anyone else to catch up. Each new generation model is an engine of escape velocity. From the stratosphere, they’re already waving goodbye.
It remains to be seen whether they’ll return.
Okay, a few things. First, I don't buy that there would be no economic value in releasing the larger Opus model and just serving it at a higher price. The only cost here is a potential risk of disappointment. But otherwise, this is always going to be profitable, as long as you don't think Opus 3 was served at a loss for Anthropic, for example.
I think you made a great point with expectations being potentially unmeetable for a non-reasoning LLM, now that we have o3 to compare them to, where openai would potentially have to release a model that is worse than an already released one on certain key benchmarks. Perhaps we are also not that far away from a merger between the reasoning type models and the normal type models, where the duration of reasoning can just be adjusted all the way down to normal inference.
You nailed it, even if it is hypothetical, it is the most likely explanation of what is happening. Not to mention that an AGI can start experimenting on some free services like social networks, not as chatbots, but in more subtle ways (algorithms or architecture improvement, new insights, etc.). We might only see the tip of the iceberg of something much bigger and incredible.