GPT-5 Is Here: There's Only One Feature Worth…

Aug 7

My short review of OpenAI’s new flagship model

53 Comments

It might "suck at creative writing" from most fingertips ... but I just encountered a short story co-written by a Twin Cities artist and his "Molly" (his AI collaborator) that was truly poetic and surprising, top to bottom. That was an eye-opening first for me. In the right creative hands, we might have a new literary art form.

Expand full comment

Would love to read it, link?

Expand full comment

https://molly-verse.com/a-letter-from-mary-shelley/

Not that familiar with Mary Shelley? https://en.wikipedia.org/wiki/Mary_Shelley.

Expand full comment

I’m really glad they’re focusing on lessening hallucinations and not trying to bury the fact they exist (as others corps are doing…)

Really good to see they’re still tracking this metric. Hope that continues

Expand full comment

I wonder if a 0% rate is even possible or if Gary Marcus is correct when he says it isn't...

Expand full comment

Aristotle Evangelos

If the error rate is 0, there is no evolution, and no exploration of an adaptive landscape. You want the error rate to be low but non-zero. Low enough that your model of the world matches it well enough to be helpful, but not so low that you never discover a new insight.

Expand full comment

Depends 100% on the application. Some *require* 0%. Others will be better off with 50%, even 100%!

Expand full comment

Aristotle Evangelos

I’d say for applications that require 0 error rate, LLMs are probably not the right tool.

Expand full comment

Agreed. However, that's too reasonable for most people 😂

Expand full comment

I'm not on 5 yet but my error and hallucination rate of 2025.210 is far higher than those percentages you gave. In law precedent in an Australian context I'd suggest it's near 80% or so. And if you correct it you stand a good chance of its correction also being made up.

Expand full comment

Aug 8Edited

I don’t think so. Since these are built on the human corpus I don’t think they can really achieve superhuman traits (aka no falsities or mistakes)

Expand full comment

The main problem is something else: because the training policy is "predict the next token," it's very hard to ensure that somehow encodes the rule "always tell the truth" or "express uncertainty rather than bullshiting if you don't know." There's also conflict with the models being helpful and harmful besides honest (can the truth hurt me? Then how would it decide what to say?) It's a really hard challenge.

Expand full comment

Daniel Situnayake

It’s impossible to measure beyond a certain level because even human experts don’t agree 100% of the time.

Expand full comment

But hallucinations are measured against *known facts*

Expand full comment

Daniel Situnayake

A lot of ground truth labels in datasets used for evaluation are based on things that are a bit fuzzier, e.g. the label is based on the input from 5 different human experts, and they may not all agree. This is common in medical datasets, for example. There’s a floor for label noise in any given dataset and it’s rarely 0%.

Expand full comment

Aug 9Edited

Agreed, and that's why you won't get 100% on any medical benchmark (or, if you do, you have a big problem), but we're talking here about the results on inference. Hallucination benchmarks are not like medical benchmarks; they're based on things we know for sure, not on things we may want to know or where there's no consensus - otherwise, how could we measure hallucinations at all!

Expand full comment

Daniel Situnayake

Hallucination isn’t the same thing as factual incorrectness; it’s about the model producing outputs that are not representative of the training distribution. A lot of the data models are trained on is not factually consistent, or involves subjectivity (Does God exist? Does broccoli taste good?), and we do not always want factual answers to our prompts (“The year is 2050. What is the most important industry?”), so evaluation is not as simple as measuring whether the model returns factually correct answers—and in any case, noisy ground truth is inescapable.

There’s some good discussion in this paper: https://arxiv.org/abs/2504.17550

Expand full comment

Agree that raising the floor is a better goal. Curious whether that will prove true with real use instead of benchmarks and training to the test. I’ll be impressed if they let skeptics test something before a release. Or if they used it to solve an actual problem that hasn’t been solved - like reducing data center energy and water use. Can’t bring myself to trust people who say they will cure cancer and save the planet if we just give them everything in the entire world.

Expand full comment

Curious as well. I guess we will know once the memes start to come out lol. It will be fun of GPT-5 hallucinates just as much as the others...

Expand full comment

Maria Sukhareva

It’s a great reduction but it’s still one in ten answers.

Expand full comment

Indeed - and I doubt they will get it to 0%. There are key limitations with how these models are trained. Will they accept it or will they keep amassing money and GPUs while they can?

Expand full comment

Thanks for the balanced review and your balanced approach throughout, Alberto.

How do you define creative writing?

Expand full comment

Thank you Ido - basically every piece of writing where the process matters as much as the outcome at least (e.g. not an email). It's a broad category but you will find GPT-5 is not very good at it. The reason is that GPT-5 is primarily a mishmash of the older models, routing to them but not really a step change on writing skills

Expand full comment

Thanks, Alberto. I am not sure I follow you, though. If someone writes a great creative novel/essay, but hates every moment of the process - or is solely focused on the end-product, is this not creative writing then?

Not trying to troll you. It's just that I grapple with these questions myself, and am trying to figure out the right way to frame it. To me it seems a greater distinguisher is the perspective taking and integration of personal experience that GPTs don't do well, but this is only a half-baked idea. Still trying to figure it out.

Expand full comment

You are right. If someone paid me to write an essay about a topic I don't care about I may not enjoy it (although I like writing itself). But I can imagine someone hating the entire process and I'd still consider it creative writing. The definition is incomplete.

Expand full comment

My understanding is that much effort was aimed at upping the ante on STEM, coding, math, and medical information. Makes my day, though I feel for the prose writers.

Expand full comment

That's right - but why the prose writers?

Expand full comment

AI for All Tomorrows

This is a great review, the hallucinations work is very important, but hopefully more people know to take chatgpt outputs with a grain of salt. Thanks!

Expand full comment

For me, the most important aspect of this hallucination / deception reduction is in the public perception of ChatGPT (and by association, all gen AI that gets lumped together in people’s minds). It makes any AI tool a harder sell, if you’re consulting. The hallucinations and “deception” (I disagree with the term as it applies here, but who am I?) really makes it difficult to have serious conversations with potential clients who are interested in finding out how AI can help them, but they’re convinced it’s unreliable. If they’re looking at a sizable $$$ outlay, how can they justify that for something that’s not only experimental, but also unreliable - and may open them up to legal liabilities? Cutting down on the appearance of unreliability can make client conversations a lot easier for those who can help get AI intelligently deployed with the people who really need it.

Expand full comment

100% agreed. Solving this for good would radically change how people perceive generative AI's capabilities

Expand full comment

Great breakdown, thank you for the Cliff Notes.😇

Expand full comment

Thank you for reading!

Expand full comment

I started skim reading dreading another tome of a post and it was refreshingly concise and easy to follow. tips imaginary hat

Expand full comment

"26% smaller than GPT-4o (11.6% vs 20.6%) and 65% smaller than o3 when thinking (4.8% vs 22%)"

Its been a long week and my mental arithmetic deteriorates almost as fast as my spelin thanks to AI but is the above calculation a hallucination ?

Expand full comment

You are right, I checked them and I mixed two different ways of measuring hallucinations without thinking twice haha it's *obviously* wrong. Will fix now - thanks!

Expand full comment

Why? It's copied from the blog post/system card. Didn't check it tbh, expect they did it correctly lol

Expand full comment

brodrick justice

GPT5 still fails our random number test: https://chatbar-ai.com/?asi=Is%20it%20true%20that%20most%20AI%20models%20cannot%20correctly%20generate%20a%20random%20number%3F

We have also identified cases where it hallucinates horribly. We are currently investgating if its a general pronlem.

Expand full comment

What is that link? It's not GPT-5?

Expand full comment

brodrick justice

its a link to our website explaining the observation. direct link here if you prefer https://business-landing.com/blog/81-chinese-ai-models-ai-commentators-confused

Expand full comment

Just tried it out using older versions of ChatGPT and Claude. You’re right - ChatGPT gave me 37 and Claude 27.

Expand full comment

I love the perspective of ‘raising the floor.’ It’s a very simple idea to digest. But from what I have found, it may lie less frequently, but it does it more confidently. To me that is far more dangerous than 20% obviously wrong (or qualified) outputs.

Expand full comment

Yeah, there's work to do there still...

Expand full comment

Agreed. Calling GPT-5 a “flop” because it isn’t AGI is like calling a Formula 1 car a failure because it can’t fly. Scaling alone was never going to get us to general intelligence, the research community has been saying that for years, but that doesn’t make GPT-5 insignificant. It’s the most capable LLM yet, with real improvements in reasoning, multimodal handling, and usability, even if it still stumbles in edge cases.

Expand full comment

brodrick justice

Just to follow up on this, we can no longer replicate the high hallucination rate we initially saw with gpt-5-chat-latest. We have no explanation for why they occurred and have decided to drop any further analysis.

GPT-5 can however be made to hallucinate in a similar manner to previous models, eg as we note in our attention seeking linkedin post here: https://www.linkedin.com/feed/update/urn:li:activity:7361126274888622080/?actorCompanyId=105298889

Expand full comment

It's also a huge cost-saving exercise for OAI: If they've realised they can get performance improvements purely through spending more tokens on thinking, this is paid for by the user and not OAI.

Can I ask a Q: What's going on with the o3 hallucination/error rates? I thought this was OAI's joint-leading model? I'm kind of speechless at seeing how bad it performs on these graphs.

Expand full comment

The whole reduced hallucination rate claim is hard to take seriously since by definition it all depends on what you ask the model. And even at 5% (the lowest value OpenAI reports for GPT5 models) you are almost guaranteed to end up with hallucinations in any long interactions.

Expand full comment

What do you mean "depends on what you ask the model"? That's the idea of doing benchmarks, to draw a statistical conclusion about how often the model fails on similar questions. If it's 0% you can be sure it's a damn factual model *overall*.

Expand full comment

There simply is a difference between asking a model "What is the capital of France?", to solve a mathematical equation or asking it a serious technical question. As for benchmarks, in practice the chance that the question you ask the model is similar to the ones in a certain benchmark is quite low and hallucination rate will vary wildly depending on the type of question you ask as can been seen for instance by looking at the GPT5 systems card, where the hallucination scores range from 0.6% (LongFact-Concepts) all the way to 54% (HealthBench hard).

Expand full comment

Of course, we want hallucinations on *all benchmarks* to go down not one benchmark

Expand full comment

Well that's quite a tall order...

Expand full comment

It is, that's why companies won't try harder but they should

Expand full comment

If 10s of billions of $$ were not enough there are probably limits to what can be done with LLMs/probabilistic generative models.

Expand full comment

Continue thread →

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts