53 Comments
User's avatar
othertomahern's avatar

It might "suck at creative writing" from most fingertips ... but I just encountered a short story co-written by a Twin Cities artist and his "Molly" (his AI collaborator) that was truly poetic and surprising, top to bottom. That was an eye-opening first for me. In the right creative hands, we might have a new literary art form.

Jake Handy's avatar

I’m really glad they’re focusing on lessening hallucinations and not trying to bury the fact they exist (as others corps are doing…)

Really good to see they’re still tracking this metric. Hope that continues

Alberto Romero's avatar

I wonder if a 0% rate is even possible or if Gary Marcus is correct when he says it isn't...

Aristotle Evangelos's avatar

If the error rate is 0, there is no evolution, and no exploration of an adaptive landscape. You want the error rate to be low but non-zero. Low enough that your model of the world matches it well enough to be helpful, but not so low that you never discover a new insight.

Alberto Romero's avatar

Depends 100% on the application. Some *require* 0%. Others will be better off with 50%, even 100%!

Aristotle Evangelos's avatar

I’d say for applications that require 0 error rate, LLMs are probably not the right tool.

Alberto Romero's avatar

Agreed. However, that's too reasonable for most people 😂

Richard Seager's avatar

I'm not on 5 yet but my error and hallucination rate of 2025.210 is far higher than those percentages you gave. In law precedent in an Australian context I'd suggest it's near 80% or so. And if you correct it you stand a good chance of its correction also being made up.

Jake Handy's avatar

I don’t think so. Since these are built on the human corpus I don’t think they can really achieve superhuman traits (aka no falsities or mistakes)

Alberto Romero's avatar

The main problem is something else: because the training policy is "predict the next token," it's very hard to ensure that somehow encodes the rule "always tell the truth" or "express uncertainty rather than bullshiting if you don't know." There's also conflict with the models being helpful and harmful besides honest (can the truth hurt me? Then how would it decide what to say?) It's a really hard challenge.

Daniel Situnayake's avatar

It’s impossible to measure beyond a certain level because even human experts don’t agree 100% of the time.

Alberto Romero's avatar

But hallucinations are measured against *known facts*

Daniel Situnayake's avatar

A lot of ground truth labels in datasets used for evaluation are based on things that are a bit fuzzier, e.g. the label is based on the input from 5 different human experts, and they may not all agree. This is common in medical datasets, for example. There’s a floor for label noise in any given dataset and it’s rarely 0%.

Alberto Romero's avatar

Agreed, and that's why you won't get 100% on any medical benchmark (or, if you do, you have a big problem), but we're talking here about the results on inference. Hallucination benchmarks are not like medical benchmarks; they're based on things we know for sure, not on things we may want to know or where there's no consensus - otherwise, how could we measure hallucinations at all!

Daniel Situnayake's avatar

Hallucination isn’t the same thing as factual incorrectness; it’s about the model producing outputs that are not representative of the training distribution. A lot of the data models are trained on is not factually consistent, or involves subjectivity (Does God exist? Does broccoli taste good?), and we do not always want factual answers to our prompts (“The year is 2050. What is the most important industry?”), so evaluation is not as simple as measuring whether the model returns factually correct answers—and in any case, noisy ground truth is inescapable.

There’s some good discussion in this paper: https://arxiv.org/abs/2504.17550

Amy A's avatar

Agree that raising the floor is a better goal. Curious whether that will prove true with real use instead of benchmarks and training to the test. I’ll be impressed if they let skeptics test something before a release. Or if they used it to solve an actual problem that hasn’t been solved - like reducing data center energy and water use. Can’t bring myself to trust people who say they will cure cancer and save the planet if we just give them everything in the entire world.

Alberto Romero's avatar

Curious as well. I guess we will know once the memes start to come out lol. It will be fun of GPT-5 hallucinates just as much as the others...

Maria Sukhareva's avatar

It’s a great reduction but it’s still one in ten answers.

Alberto Romero's avatar

Indeed - and I doubt they will get it to 0%. There are key limitations with how these models are trained. Will they accept it or will they keep amassing money and GPUs while they can?

Ido Hartogsohn's avatar

Thanks for the balanced review and your balanced approach throughout, Alberto.

How do you define creative writing?

Alberto Romero's avatar

Thank you Ido - basically every piece of writing where the process matters as much as the outcome at least (e.g. not an email). It's a broad category but you will find GPT-5 is not very good at it. The reason is that GPT-5 is primarily a mishmash of the older models, routing to them but not really a step change on writing skills

Ido Hartogsohn's avatar

Thanks, Alberto. I am not sure I follow you, though. If someone writes a great creative novel/essay, but hates every moment of the process - or is solely focused on the end-product, is this not creative writing then?

Not trying to troll you. It's just that I grapple with these questions myself, and am trying to figure out the right way to frame it. To me it seems a greater distinguisher is the perspective taking and integration of personal experience that GPTs don't do well, but this is only a half-baked idea. Still trying to figure it out.

Alberto Romero's avatar

You are right. If someone paid me to write an essay about a topic I don't care about I may not enjoy it (although I like writing itself). But I can imagine someone hating the entire process and I'd still consider it creative writing. The definition is incomplete.

Harold Toups's avatar

My understanding is that much effort was aimed at upping the ante on STEM, coding, math, and medical information. Makes my day, though I feel for the prose writers.

Alberto Romero's avatar

That's right - but why the prose writers?

AI for All Tomorrows's avatar

This is a great review, the hallucinations work is very important, but hopefully more people know to take chatgpt outputs with a grain of salt. Thanks!

KayStoner's avatar

For me, the most important aspect of this hallucination / deception reduction is in the public perception of ChatGPT (and by association, all gen AI that gets lumped together in people’s minds). It makes any AI tool a harder sell, if you’re consulting. The hallucinations and “deception” (I disagree with the term as it applies here, but who am I?) really makes it difficult to have serious conversations with potential clients who are interested in finding out how AI can help them, but they’re convinced it’s unreliable. If they’re looking at a sizable $$$ outlay, how can they justify that for something that’s not only experimental, but also unreliable - and may open them up to legal liabilities? Cutting down on the appearance of unreliability can make client conversations a lot easier for those who can help get AI intelligently deployed with the people who really need it.

Alberto Romero's avatar

100% agreed. Solving this for good would radically change how people perceive generative AI's capabilities

ElandPrincess's avatar

Great breakdown, thank you for the Cliff Notes.😇

Alberto Romero's avatar

Thank you for reading!

ElandPrincess's avatar

I started skim reading dreading another tome of a post and it was refreshingly concise and easy to follow. tips imaginary hat

Chris's avatar

"26% smaller than GPT-4o (11.6% vs 20.6%) and 65% smaller than o3 when thinking (4.8% vs 22%)"

Its been a long week and my mental arithmetic deteriorates almost as fast as my spelin thanks to AI but is the above calculation a hallucination ?

Alberto Romero's avatar

You are right, I checked them and I mixed two different ways of measuring hallucinations without thinking twice haha it's *obviously* wrong. Will fix now - thanks!

Alberto Romero's avatar

Why? It's copied from the blog post/system card. Didn't check it tbh, expect they did it correctly lol

brodrick justice's avatar

GPT5 still fails our random number test: https://chatbar-ai.com/?asi=Is%20it%20true%20that%20most%20AI%20models%20cannot%20correctly%20generate%20a%20random%20number%3F

We have also identified cases where it hallucinates horribly. We are currently investgating if its a general pronlem.

Alberto Romero's avatar

What is that link? It's not GPT-5?

brodrick justice's avatar

its a link to our website explaining the observation. direct link here if you prefer https://business-landing.com/blog/81-chinese-ai-models-ai-commentators-confused

Anu | Happy Landings's avatar

Just tried it out using older versions of ChatGPT and Claude. You’re right - ChatGPT gave me 37 and Claude 27.

Fire & Algebra's avatar

I love the perspective of ‘raising the floor.’ It’s a very simple idea to digest. But from what I have found, it may lie less frequently, but it does it more confidently. To me that is far more dangerous than 20% obviously wrong (or qualified) outputs.

Alberto Romero's avatar

Yeah, there's work to do there still...

Syntax Aegis's avatar

Agreed. Calling GPT-5 a “flop” because it isn’t AGI is like calling a Formula 1 car a failure because it can’t fly. Scaling alone was never going to get us to general intelligence, the research community has been saying that for years, but that doesn’t make GPT-5 insignificant. It’s the most capable LLM yet, with real improvements in reasoning, multimodal handling, and usability, even if it still stumbles in edge cases.

brodrick justice's avatar

Just to follow up on this, we can no longer replicate the high hallucination rate we initially saw with gpt-5-chat-latest. We have no explanation for why they occurred and have decided to drop any further analysis.

GPT-5 can however be made to hallucinate in a similar manner to previous models, eg as we note in our attention seeking linkedin post here: https://www.linkedin.com/feed/update/urn:li:activity:7361126274888622080/?actorCompanyId=105298889

Shaeda's avatar

It's also a huge cost-saving exercise for OAI: If they've realised they can get performance improvements purely through spending more tokens on thinking, this is paid for by the user and not OAI.

Can I ask a Q: What's going on with the o3 hallucination/error rates? I thought this was OAI's joint-leading model? I'm kind of speechless at seeing how bad it performs on these graphs.

Anatol Wegner, PhD's avatar

The whole reduced hallucination rate claim is hard to take seriously since by definition it all depends on what you ask the model. And even at 5% (the lowest value OpenAI reports for GPT5 models) you are almost guaranteed to end up with hallucinations in any long interactions.

Alberto Romero's avatar

What do you mean "depends on what you ask the model"? That's the idea of doing benchmarks, to draw a statistical conclusion about how often the model fails on similar questions. If it's 0% you can be sure it's a damn factual model *overall*.

Anatol Wegner, PhD's avatar

There simply is a difference between asking a model "What is the capital of France?", to solve a mathematical equation or asking it a serious technical question. As for benchmarks, in practice the chance that the question you ask the model is similar to the ones in a certain benchmark is quite low and hallucination rate will vary wildly depending on the type of question you ask as can been seen for instance by looking at the GPT5 systems card, where the hallucination scores range from 0.6% (LongFact-Concepts) all the way to 54% (HealthBench hard).

Alberto Romero's avatar

Of course, we want hallucinations on *all benchmarks* to go down not one benchmark

Anatol Wegner, PhD's avatar

Well that's quite a tall order...

Alberto Romero's avatar

It is, that's why companies won't try harder but they should

Anatol Wegner, PhD's avatar

If 10s of billions of $$ were not enough there are probably limits to what can be done with LLMs/probabilistic generative models.