GPT-5: OpenAI’s Flagship Model Faces Great Expectations

People don't like an AI that's dumb at times

Aug 04, 2025

∙ Paid

GPT-5, OpenAI’s new flagship model, is coming out anytime now. Before it does, I want to share how I’m feeling about it and what I think we should expect. My intention with this article is to contextualize a product release that’s poised to be mistreated in all directions, through over-hype, under-hype, misinformation, etc. I have one important advantage—GPT-5 is not out yet. I can’t possibly disguise my opinion and pre-announcement impressions as fact! Read accordingly.

First: I’m excited. Perhaps unjustifiably so because I’ve been waiting for GPT-5 for years (GPT-4 came out in February 2023 with Microsoft’s Bing update and then officially in March 2023). But I want it—and expect it—to be good, and I want OpenAI to have a successful announcement. That’s what I genuinely feel, even though I’ve been pretty unforgiving toward the AI industry in my last few articles.

However, I expect people will react to it with unfair disappointment. I’ll be unfair because GPT-5 is likely a good model (I haven’t tested it, but there have been unofficial leaks, and otherwise, OpenAI would have taken more time if they didn’t think it was ready). And “disappointment” for two reasons:

GPT-5 faces impossible expectations, which is the sole fault of the constant over-hype coming from OpenAI itself, in particular, and the AI community in general.
GPT-5 will raise the ceiling of capabilities, which is hard to check firsthand for most people (unless you’re a world-class mathematician or something), but will show an embarrassing lack of updates at the floor level: hallucinations will still be too high, agentic capabilities will break down at the scope frontier and on edge cases, ARC-AGI 2 and 3 will go unsolved, benchmark results won’t materialize in real-world equivalent cases, unreliability will kill most attempts at integrating the technology in work processes, etc.

I’ll dig now into why GPT-5 can be both good and disappointing at the same time (which I think is happening with all AI models, not just GPT-5), but first, I want to clear up a few misconceptions and give credit where it’s due.

Important context before GPT-5 comes out

I’ll address here three ideas that are broadly misunderstood yet often taken for granted: 1) GPT-4.5 was a failure, 2) the scaling laws are over, and 3) OpenAI is falling behind. All three are at least partly false. Correcting them is key to better framing GPT-5’s release.

GPT-4.5 was a failure

In a way, yes: GPT-4.5—which I reviewed during launch week—was expensive to train and is expensive to serve. It didn’t beat benchmark scores. It’s slow and heavy and has the same pain points as the models that preceded it (hallucinations, unreliability, etc.). Being a base model, it was too raw for the public, but by far the hardest challenge OpenAI engineers faced with it—which The Information revealed for the first time last week—was imbuing high reasoning capabilities into the chat version of the model.

Quick recap: In December 2024, OpenAI teased o3 with an incredible result for ARC-AGI. But the o3 version they released in April 2025 for ChatGPT was a worse model. It didn’t perform as well on ARC-AGI. Why? Transforming the original o3 (itself distilled from a larger unnamed model) into a version that could chat and people could interact with drastically reduced its capabilities (more than the standard downgrade caused by the distillation process itself!). Apparently, you kill AI’s genius when you force it to speak in human language because it’s inefficient for an AI that’s been taught to think in the latent space.

So, when OpenAI tried to create a big reasoning model (o3 is small-ish) that could chat and also had breadth of knowledge (that is, a better base model pre-trained on a lot of data + o3-style reasoning capabilities + usable on ChatGPT as a chatbot), they failed. This big model was intended to be GPT-5, but after it fell short of expectations, they had to settle with GPT-4.5, a non-reasoning model.

This combination of factors made GPT-4.5 a commercial failure (to the point that OpenAI preemptively warned they’d eventually take it down from the API, which they just did). But by no means was it a technical failure. (Learning empirically that models work better when you let them think in their own gibberish language is an amazing discovery! In any case, GPT-4.5 was the by-product of a technical failure.)

I keep saying that companies are better off keeping base models like GPT-4, and GPT-4.5 internally as scaffolding to pre-train larger base models or to post-train them into reasoning models (or perhaps for B2B plans, so other companies can fine-tune them as they please). In January, I argued in favor of this idea in “This Rumor About GPT-5 Changes Everything,” where I wrote that OpenAI had already trained GPT-5 but wouldn’t release it because they could make better use of it internally. I was right—except the teacher model they use to distill smaller ones was not GPT-5, but an unnamed model they keep in the shadows. (Of course, OpenAI, like all other frontier AI labs, has better monsters in the basement than those they let us play with.)

Anyway, that’s the point I wanted to address: GPT-4.5’s status as a “failure” is not an indicator of how well or badly OpenAI is doing internally with GPT-5. So their excitement may be warranted. We just don’t know yet.

The Information also reported—take this with a grain of salt until you can try it yourself—that GPT-5 is better at math and coding, it has better agentic capabilities, and it’s learned to adapt the compute resources it needs according to the task (like a flexible o3/o4 reasoning model). OpenAI said they’d soon fix the model naming problem they have going on, and GPT-5’s unified nature is probably the reason why.

The scaling laws are over

Last year, AI companies realized that making base models larger (e.g., GPT-4 to GPT-4.5) was yielding diminishing returns, and the press was quick to misreport this as “scale is over.” (I urge you to watch this talk by Ilya Sutskever at NeurIPS 2024 in December.) Instead, the focus shifted to scaling test-time compute (i.e., allowing models to think answers through), spearheaded by OpenAI with o1 and now adopted by every top lab. They proved that reinforcement learning combined with supervised fine-tuning was highly effective—especially in structured domains like math and coding, where well-defined, verifiable reward functions exist.

But scaling never stopped working—not even in the pre-training phase! I acknowledge that the jumps from GPT-2 to GPT-3 to GPT-4 may never happen again, that's true. One reason is that companies have exhausted the low-hanging internet data. Another is the natural limits of the underlying architecture of large language models (the transformer). Another is the chip and power shortage (that’s why the industry is obsessed with building infrastructure). But it “doesn’t matter” (quotation marks needed) because this was all factored in from the beginning!

No serious researcher ever said these laws would hold forever or would never show deviations from super-linear returns. And when the worst predictions took hold and the pre-training scaling laws slowed down, they started scaling something else (post-training phase). Reasoning AI models are a novel research avenue—who would have guessed that baking a chat model from a reasoning one would degrade performance due to human language being inefficient!—and we’re yet to see how far it can take us.

In line with this, The Information also reports that another technical hurdle OpenAI engineers encountered when training GPT-5 was the lack of a “universal verifier,” something that could help engineers reinforce AI models in both objective (coding, math) and subjective areas (writing, general knowledge). Pursuing this idea allowed them to improve GPT-5’s overall capability range without losing the gains.

In any case, critiques of the scaling laws in the form “The transformer is the wrong architecture,” or “The AI industry has exhausted human data and synthetic data may not suffice,” or “They should integrate LLMs/deep learning with other AI paradigms because models are only learning superficially” hold water and will still after GPT-5.

OpenAI is falling behind

I published “OpenAI Is in Trouble (For Real This Time)” on July 1st. The basic idea was that the talent churn and poaching by competitors (mostly Mark Zuckerberg at Meta for his new superintelligence team) would eventually bleed OpenAI, and that the perpetual feud with its preferred partner, Microsoft, wasn’t helping. I still stand by that, and in the month since I published that post, things have only gotten worse: OpenAI’s two main rivals—Google DeepMind and Anthropic—have each delivered another blow.

First, the IMO gold story: Unreleased AI models from both Google DeepMind and OpenAI achieved the gold medal with the same score (35/42 points, both models failing to solve the 6th problem) on the IMO competition (math olympiad for high schoolers) earlier this year. The competition’s board requested AI companies to withhold their announcements until after the results had been verified and the ceremony was over; Google DeepMind obliged, but OpenAI, which wasn’t officially in touch with the IMO board, didn’t.

Is it too cynical to think that this is the result of putting marketing motivations before etiquette and good manners? I guess one trait of the modern business world in Silicon Valley is that anything goes (although Google DeepMind’s CEO, Demis Hassabis, is always a class act).

However—this is the actual bad news—despite OpenAI stealing the spotlight for a short time (people are growing tired of Sam Altman’s shenanigans), Google DeepMind has been quicker in rolling out the IMO gold-winning model (an IMO bronze-winning version is already available in the Gemini app for Ultra subscribers and a group of mathematicians is testing the full model). Altman, on the contrary, made it clear they “don't plan to release a model with IMO gold level of capability for many months.” He’s implicitly saying that they are “too responsible to release such a powerful model,” emphasizing both OpenAI’s commitment to safety and its technical prowess. However, no one buys that OpenAI cares first and foremost about safety; that’s why many of its former employees have ended up at Anthropic, the real leader on safety. And Anthropic, as it happens, is behind the other recent piece of bad news.

I was surprised to learn that “Anthropic has revoked OpenAI's access to Claude,” as Wired reported last week, on August 1st. It seems OpenAI developers were using Anthropic’s Claude Code model instead of their own models to code. This may not fall under anti-competitive behavior if OpenAI is using Claude’s tokens to post-train its models, which is a violation of the terms of service (funny thing: OpenAI accused DeepSeek of doing this as a warning against the Chinese company; they all do this!) OpenAI staff, of course, are complaining.

And still! Is OpenAI falling behind? It doesn’t seem to: “OpenAI Hits $12 Billion in Annualized Revenue, Breaks 700 Million ChatGPT Weekly Active Users.” And, given the leaks and pre-release coverage, GPT-5 appears to be a great model. My guess is OpenAI will focus on growth, revenue, and infrastructure for now and will downplay the pervasive internal talent bleeding they’re suffering and the lack of recent technical achievements (that is, unless the universal verifier—and other secret things—works).

Now, equipped with better context about GPT-5’s predecessor, about the state of the paradigm on which GPT-5 is being conceived (the scaling laws), and OpenAI’s general situation as a leading business in a cutthroat industry, we’re better prepared to understand why GPT-5 can be good but also disappointing.

If GPT-5 is good, why would it be disappointing?

The explanation is a four-part story that illustrates well why the general public’s perception of AI is so different from that of the people in the industry, like AI researchers, tech executives, and insiders.

The Algorithmic Bridge

GPT-5: OpenAI’s Flagship Model Faces Great Expectations

People don't like an AI that's dumb at times

Important context before GPT-5 comes out

If GPT-5 is good, why would it be disappointing?

This post is for paid subscribers