GPT-4.5 Feels Like a Letdown But It’s OpenAI’s Biggest Bet Yet

It's not a setback but the setup

Feb 28, 2025

I. GPT-4.5 is the step back before a big jump

OpenAI has launched GPT-4.5 (blog post, system card, demo), their latest and largest AI model. They've been hinting at it under the name Orion for more than a year, at times confused with GPT-5. It's finally here and it is… underwhelming. Or at least it looks underwhelming. This post is about why this nuance between “is” and “looks” is fundamental to understanding what's going on.

You probably have a lot of questions: Why did GPT-4.5 get worse benchmark scores than models launched months ago? Why did OpenAI wait more than a year to release a model that’s not state-of-the-art? Why is it much more expensive than previous OpenAI models and rival offerings? Why have they made it so large if the pre-training scaling laws have plateaued? If they're so obsessed with reasoning—getting good performance on math and code—why are they suddenly focusing on creativity, intuition, and emotional intelligence?

Thankfully for you, I have the answers. Thankfully for me, OpenAI’s lack of transparency—and its missed opportunity to market this interesting model effectively—gives this newsletter a purpose.

Anyway, we will go over GPT-4.5’s specifications and benchmark evaluations. I will report what OpenAI has shared in the demo and the system card (also what they’ve quietly changed after public feedback, updating the document) and what early testers have reported from their anecdotal experience (GPT-4.5 is only accessible to Pro users but will be rolled out to Plus users next week).

I will also revisit those questions above so we can make sense of this seemingly underwhelming model—and explain why I think OpenAI still has the mandate of heaven despite Sonnet 3.7, DeepSeek-R1, and Grok 3.

To not make you wait much, I’ll give you a hint: Today, GPT-4.5 is a nuisance. Tomorrow, it's OpenAI's edge.

This isn’t the kind of release a company celebrates. OpenAI is less excited about launching it than about getting it out of the way so they can move forward with the main course, which will arrive in due time (weeks or months).

GPT-4.5 is, in short, the step back you take to gain momentum before a big jump.

Then comes the jump.

Get 20% off forever

II. Disappointment is warranted: expensive, slow, and outdated

The system card starts like this (the blog post says more or less the same thing):

We’re releasing a research preview of OpenAI GPT-4.5, our largest and most knowledgeable model yet. Building on GPT-4o, GPT-4.5 scales pre-training further and is designed to be more general-purpose than our powerful STEM-focused reasoning models.

And then adds:

GPT-4.5 is not a frontier model, but it is OpenAI’s largest LLM, improving on GPT-4’s computational efficiency by more than 10x. While GPT-4.5 demonstrates increased world knowledge, improved writing ability, and refined personality over previous models, it does not introduce net-new frontier capabilities . . .

This pretty much sets up the tone.

GPT-4.5 is super large and thus super expensive to run even with the 10x computational efficiency gains. Yet it's not frontier in capabilities, reflected in the benchmark scores that I’ll share soon. This doesn’t look good.

(“10x efficiency improvement” and “not a frontier model” are the two bits they took out of the original system card pdf. I believe the reason is that they realized too late this was a PR disaster: how is it 10x more efficient yet 10x-25x more expensive? Why are you giving us a worse model with an October 2023 knowledge cutoff, no less?)

I have to stop here to comment on the rumors of GPT-4.5’s size. 10x efficiency yet 10x cost is a sign that this model is really big. It’s probably a Mixture of Experts, in line with earlier models by OpenAI and DeepSeek. It’s been hypothesized that it has 1 trillion active parameters and is in the ballpark of 10-15 trillion total parameters, which would make accurate my last year's prediction of GPT-5’s size (ignoring OpenAI’s arbitrary numbering).

Anyway, this is pure speculation, and that’s what makes it fun. We’ll never know for sure, but keep an eye out for a fine-grained analysis from Semianalysis, EpochAI, or others.

Besides being large and expensive, GPT-4.5 is not a reasoning model (not focused on STEM and reasoning) but a general base model intended for soft-skill applications (e.g., a standard chatbot). Comparing GPT-4.5 to a reasoning model (e.g., o1/o3) on the typical benchmarks everyone publishes nowadays (GPQA, math and code competitions, agentic tests like SWE-bench, or even the already-saturated MMLU) is a bad idea.

To me, a writer, this is good news. I’ve been waiting for a model with a better sense of aesthetics and taste as well as a better writing style for years. For most ChatGPT users—non-scientists—it is good news as well. At the same cost—you’ll still pay either $200/month or $20/month—we'll get a model that writes and communicates slightly better, more naturally, and more humane.

But for those who have been benefitting from the recent streak of cheap STEM-focused reasoning models (o1/o3, Sonnet 3.7, DeepSeek-R1, Grok 3) through the API, it will feel like a huge step back. OpenAI is not hiding it but I'm not sure their honesty will appease anyone. At $150 per million output tokens, GPT-4.5 is 150-300x more expensive than DeepSeek V3.

That’s why OpenAI says at the end of the blog post that they’re “evaluating whether to continue serving it in the API long-term.” Serving GPT-4.5 takes resources away from training the next models. Besides, a larger size implies it’s slow as well. Who would pay 300 times the price to use this sloth when competitors have already commoditized its value?

I don’t really understand why they’re serving GPT-4.5 at all (not just the API but even ChatGPT). Retiring the model would go in line with what I argued in my post on GPT-5: OpenAI should keep these super large base models internally to distill better ones that are as capable but also cheap to serve. Your 400 million weekly active users—who will always choose readiness over size—as well as your pockets, would thank you.

Anyway, GPT-4.5 going away in the API is not a big deal. Among the many million users that OpenAI serves only a tiny fraction watches the demos, comments on Sam Altman's posts, or cares at all about how the product changes. Among those who care, only a tiny fraction pay for the API at all. (Also, they tend to prefer Anthropic for that.)

Sam Altman knows OpenAI can momentarily disappoint scientists, programmers, and researchers because ChatGPT subscriptions keep revenue secure. GPT-4.5 is a temporary nuisance he can afford.

III. Sometimes you don’t want to raise the ceiling but the floor

This is the most important sentence in the system card pdf:

GPT-4.5 is our next step in scaling the unsupervised learning paradigm.

This will probably go over the heads of most people. AI-savvy grifters eager to misinterpret the implications will do so freely, with little pushback from the timeline. OpenAI tried hard to get this message across, but I think it will fall on deaf ears. If you take away just one thing from this post, let it be this:

They've not trained GPT-4.5 to raise the ceiling but to raise the floor.

Scaling unsupervised learning, which is another way to say “scaling pre-training”—the first stage when tons of internet data is dumped into the model so that it gets rudimentary linguistic skills and general knowledge about the world—means they're scaling the baseline potential of the thing.

Since OpenAI introduced o1-preview in September 2024, the focus shifted to reasoning scaling laws (or test-time compute scaling). Everyone followed. The important thing was not how good the base model was but how good was the post-training stage that allowed it to think.

GPT-4.5 marks a much-needed revisit of the pre-training scaling laws. Everyone agrees they’re yielding diminishing returns—regardless of what the press and the skeptic crowd say, this doesn’t mean they’re yielding zero returns—but it was always inevitable that AI companies would revisit them from time to time to do large-scale pre-training even if, at first glance, those stepping-stone models like GPT-4.5 seemed like a step back in capabilities.

I drew this graph in my first article on DeepSeek to illustrate why companies need better base models to keep improving their technology:

This is one of the biggest mistakes most people make when assessing the value of AI models.

They look at the data point (GPT-4.5’s benchmarks are meh), not the trajectory (what does OpenAI want GPT-4.5 for?). They think pre-training and post-training exist at odds with each other (GPT-4.5 is a black pill for the scaling laws) instead of thinking they complement one another (if I have a better base pre-trained model I will be able to train better reasoning models through distillation + post-training).

(One common belief in AI circles is that language reasoning models only started to work last year because base models had to be a certain size and trained with a certain amount of data to even have a chance to attain decent reasoning capabilities. This approach had been tried before but didn’t work because of a lack of better base models. GPT-4.5, trained on 10x more compute than GPT-4, could embody the same launchpad moment for the next-gen models distilled and post-trained from it.)

In case it’s not clear how to think about GPT-4.5 compared to other models, let me use a human analogy: GPT-4.5 is like a newborn toddler with brilliant parents and impressive genetics versus, say, o1/R1 who is a decent-but-not-genius adult who already went to math class in college. I imagine you'd choose the latter to work for you. However, that baby will eventually grow: GPT-4.5 will serve as the basis for GPT-5, o4, and so on.

People will keep saying that GPT-4.5 is a disappointment. That pre-training is so dead. But dead for what? To get a higher score on Codeforces or AIME? Yes—to get gold in the Informatics Olympiad it is very much dead. Who cares? You have crazily good models at math and code that can't solve tricky permutations of the simplest riddles. So you gotta raise the floor. No AGI is dumb at times. If GPT-4.5 is less smart but also less dumb, and also a means to make new models less dumb as well—I’m all for it.

That’s why training GPT-4.5 is a critical move for OpenAI, even if the model itself is underwhelming and the product is a disaster in terms of latency and pricing.

IV. Why did you severely undersell GPT-4.5’s best skills?

But still, OpenAI could have done better. Because even if GPT-4.5 doesn’t like to think much, it has other virtues. Let's look at the graphs, which will reveal why I'm giving you a heads-up that GPT-4.5 will be less impressive once you try it than you expect:

SimpleQA measures LLM (large language model) factuality on straightforward but challenging knowledge questions.

Results on SimpleQA reveal that GPT-4.5 knows more about the world and hallucinates less than other OpenAI models (note that OpenAI didn’t test against any models from other companies). It’s also notable the jump from GPT-4o on GPQA, AIME 2024, SWE-Lancer Diamond, and SWE-bench Verified (STEM and agentic benchmarks). So far so good.

However, it’s interesting that GPT-4.5 is still worse than DeepSeek V3 on every benchmark where both were tested (GPQA, AIME 2024, and SWE-bench Verified):

So, yes, a nice improvement over GPT-4o but not an improvement over the best non-reasoning models out there—including a model that’s 2-3 orders of magnitude cheaper. This smells like trouble for OpenAI.

(Let me note that V3, although not a fully reasoning model, was post-trained with some reinforcement learning. This post-training phase seems to be absent in GPT-4.5, which was fine-tuned and RLHFed normally to make it into a friendly and obedient chatbot.)

OpenAI presented other benchmark scores in the system card that only reiterate this story: it’s not a good model compared to what exists out there.

But, why did they use the standard STEM/agent benchmarks that everyone is using (that OpenAI pioneered)? Why haven’t they instead focused on benchmarks that better reflect the idiosyncrasies of GPT-4.5, such as creative writing, intuition, emotionality, etc.? Professor Ethan Mollick says that contrary to what one might think, these things are also measurable.

Perhaps they don’t think those things can be measured except through vibes and taste. Perhaps they held back on making GPT-4.5 feel too impressive, ensuring it wouldn’t overshadow the leap to GPT-5. The plan might have been to weather the storm until GPT-5 arrived, but they overdid the sincerity and realized they should patch the giveaways in the system card.

In any case, I don’t think GPT-4.5 will remain OpenAI’s latest model for long. To reiterate myself, I feel they see this model—and the attention it has inevitably drawn to its bad benchmarks, high price, high latency, etc.—as a temporary nuisance.

V. The tyranny of the low-taste tester: People prefer GPT-4

I want to make a special mention of Andrej Karpathy’s posts on GPT-4.5.

He wrote a long tweet explaining how he feels about the progression of the GPT family since GPT-1 and, in just one sentence, summarizes why it’s important to separate GPT-4.5 from the reasoning models out there:

. . . this release offers a qualitative measurement of the slope of improvement you get out of scaling pretraining compute (i.e. simply training a bigger model).

Yet he also notes that, if anything, the improvement over GPT-4 is slight, subtle, and almost unnoticeable with standard prompts.

He tried anyway. He did a poll and asked us to choose which one we preferred between models A and B in a blind test with five different prompt-generation pairs (checking for vibes, creativity, humor, etc i.e., the areas where GPT-4.5 should be better than GPT-4).

I did the test and found it rather hard to tell them apart. I noticed that one felt closer to the current GPT-4o so I went for that one (it felt more fresh, less formulaic).

Other people, however, thought differently. Karpathy has shared the results of the polls and, overwhelmingly, people prefer GPT-4.

Question 1: GPT4.5 is A; 56% of people prefer it.
Question 2: GPT4.5 is B; 43% of people prefer it.
Question 3: GPT4.5 is A; 35% of people prefer it.
Question 4: GPT4.5 is A; 35% of people prefer it.
Question 5: GPT4.5 is B; 36% of people prefer it.

As he said, this is awkward.

Here’s how Karpathy concludes:

Either the high-taste testers are noticing the new and unique structure but the low-taste ones are overwhelming the poll. Or we're just hallucinating things. Or these examples are just not that great. Or it's actually pretty close and this is way too small sample size. Or all of the above. So we'll just wait for the larger, more thorough LM Arena results. But at least from my last 2 days of playing around, 4.5 has a new, deeper charm, it's more creative and inventive at writing, and I find myself laughing more at its jokes, standups and roasts.

I’d be actually not surprised if people generally preferred GPT-4/GPT-4o to GPT-4.5. It’s what you should expect if GPT-4.5 was better. After all, people prefer AI slop poetry to human poetry. We can’t all be high-taste testers.

VI. No surprise for the writer, no surprise for the reader

I wonder how much better AI can get at writing.

How can we even measure it? Why do chatbots still feel stupid when you evaluate them in the only benchmark that should matter—the one we’ve forgotten: being more humane? It’s Moravec’s Paradox all over again. It’s not so easy to write well.

Companies moved too soon to STEM benchmarks, especially code and math, without having allowed models to master the art of writing, chatting, humor, and creativity. In a way, GPT-4.5 is a step in this direction. (I actually don't think GPT-4.5 is that good at writing even if it’s a bit better than its predecessors; this is not that good.)

Let me close this post with a reflection on why I believe that’s the case and why, against all odds, writing remains one of the hardest skills for a language model to master.

AI-generated writing feels flat without a human hand guiding it because language models are intrinsically incapable of surprise. They’re designed to do the opposite of upending your expectations yet the best writing thrives on exactly that.

Poet Robert Frost said, “No surprise for the writer, no surprise for the reader.”

If you can't surprise your readers, you’re not doing a good job.

What a pity that LLMs are designed to select the most likely token rather than, now and then, at random, being able to choose one that’s less likely.

And I don’t mean a nonsensical token—that’s the brilliance and art of human writing: finding those unlikely words that land somewhere between the ordinary and the absurd, like tree branches that twist in unexpected directions and shapes. There’s a sweet spot in there. Hard to find. Harder to master.

AI doesn’t yet inhabit this land. And as things stand today, it may never.

Get 20% off forever

Jack Pierce

Feb 28

Mostly, I have a significant amount of source content, and I’m asking it to pull together relationships that I wouldn’t normally see myself or rewrite to a format with particular attention to numbers of sentences and how they’re treated. It used to do a superb job of this, but now it stumbles on the easiest requests.

Expand full comment

1 reply by Alberto Romero

imthinkingthethoughts

Great writings Alberto. You are particularly prophetic here, and I haven’t seen anyone come close so far.

Do note that it is my understanding current LLMs do occasionally choose less than optimal next token predications, so there is some form of randomness. The problem is that randomly selecting a less likely word does not equate to a writing subtly changing the topic as that typically is on a more meta conceptual level.

The 2023 cutoff is quite astounding. I do wonder just how much of the internet since has become slop and if GPT-678 still have a knowledge cutoff of 2023 or so with other premium and vetted sources to sort wheat from chaff

7 more comments...

The Algorithmic Bridge

Discussion about this post