It might "suck at creative writing" from most fingertips ... but I just encountered a short story co-written by a Twin Cities artist and his "Molly" (his AI collaborator) that was truly poetic and surprising, top to bottom. That was an eye-opening first for me. In the right creative hands, we might have a new literary art form.
If the error rate is 0, there is no evolution, and no exploration of an adaptive landscape. You want the error rate to be low but non-zero. Low enough that your model of the world matches it well enough to be helpful, but not so low that you never discover a new insight.
I'm not on 5 yet but my error and hallucination rate of 2025.210 is far higher than those percentages you gave. In law precedent in an Australian context I'd suggest it's near 80% or so. And if you correct it you stand a good chance of its correction also being made up.
The main problem is something else: because the training policy is "predict the next token," it's very hard to ensure that somehow encodes the rule "always tell the truth" or "express uncertainty rather than bullshiting if you don't know." There's also conflict with the models being helpful and harmful besides honest (can the truth hurt me? Then how would it decide what to say?) It's a really hard challenge.
A lot of ground truth labels in datasets used for evaluation are based on things that are a bit fuzzier, e.g. the label is based on the input from 5 different human experts, and they may not all agree. This is common in medical datasets, for example. There’s a floor for label noise in any given dataset and it’s rarely 0%.
Agreed, and that's why you won't get 100% on any medical benchmark (or, if you do, you have a big problem), but we're talking here about the results on inference. Hallucination benchmarks are not like medical benchmarks; they're based on things we know for sure, not on things we may want to know or where there's no consensus - otherwise, how could we measure hallucinations at all!
Hallucination isn’t the same thing as factual incorrectness; it’s about the model producing outputs that are not representative of the training distribution. A lot of the data models are trained on is not factually consistent, or involves subjectivity (Does God exist? Does broccoli taste good?), and we do not always want factual answers to our prompts (“The year is 2050. What is the most important industry?”), so evaluation is not as simple as measuring whether the model returns factually correct answers—and in any case, noisy ground truth is inescapable.
Agree that raising the floor is a better goal. Curious whether that will prove true with real use instead of benchmarks and training to the test. I’ll be impressed if they let skeptics test something before a release. Or if they used it to solve an actual problem that hasn’t been solved - like reducing data center energy and water use. Can’t bring myself to trust people who say they will cure cancer and save the planet if we just give them everything in the entire world.
Indeed - and I doubt they will get it to 0%. There are key limitations with how these models are trained. Will they accept it or will they keep amassing money and GPUs while they can?
Thank you Ido - basically every piece of writing where the process matters as much as the outcome at least (e.g. not an email). It's a broad category but you will find GPT-5 is not very good at it. The reason is that GPT-5 is primarily a mishmash of the older models, routing to them but not really a step change on writing skills
Thanks, Alberto. I am not sure I follow you, though. If someone writes a great creative novel/essay, but hates every moment of the process - or is solely focused on the end-product, is this not creative writing then?
Not trying to troll you. It's just that I grapple with these questions myself, and am trying to figure out the right way to frame it. To me it seems a greater distinguisher is the perspective taking and integration of personal experience that GPTs don't do well, but this is only a half-baked idea. Still trying to figure it out.
You are right. If someone paid me to write an essay about a topic I don't care about I may not enjoy it (although I like writing itself). But I can imagine someone hating the entire process and I'd still consider it creative writing. The definition is incomplete.
My understanding is that much effort was aimed at upping the ante on STEM, coding, math, and medical information. Makes my day, though I feel for the prose writers.
This is a great review, the hallucinations work is very important, but hopefully more people know to take chatgpt outputs with a grain of salt. Thanks!
For me, the most important aspect of this hallucination / deception reduction is in the public perception of ChatGPT (and by association, all gen AI that gets lumped together in people’s minds). It makes any AI tool a harder sell, if you’re consulting. The hallucinations and “deception” (I disagree with the term as it applies here, but who am I?) really makes it difficult to have serious conversations with potential clients who are interested in finding out how AI can help them, but they’re convinced it’s unreliable. If they’re looking at a sizable $$$ outlay, how can they justify that for something that’s not only experimental, but also unreliable - and may open them up to legal liabilities? Cutting down on the appearance of unreliability can make client conversations a lot easier for those who can help get AI intelligently deployed with the people who really need it.
You are right, I checked them and I mixed two different ways of measuring hallucinations without thinking twice haha it's *obviously* wrong. Will fix now - thanks!
I love the perspective of ‘raising the floor.’ It’s a very simple idea to digest. But from what I have found, it may lie less frequently, but it does it more confidently. To me that is far more dangerous than 20% obviously wrong (or qualified) outputs.
It's also a huge cost-saving exercise for OAI: If they've realised they can get performance improvements purely through spending more tokens on thinking, this is paid for by the user and not OAI.
Can I ask a Q: What's going on with the o3 hallucination/error rates? I thought this was OAI's joint-leading model? I'm kind of speechless at seeing how bad it performs on these graphs.
The whole reduced hallucination rate claim is hard to take seriously since by definition it all depends on what you ask the model. And even at 5% (the lowest value OpenAI reports for GPT5 models) you are almost guaranteed to end up with hallucinations in any long interactions.
What do you mean "depends on what you ask the model"? That's the idea of doing benchmarks, to draw a statistical conclusion about how often the model fails on similar questions. If it's 0% you can be sure it's a damn factual model *overall*.
There simply is a difference between asking a model "What is the capital of France?", to solve a mathematical equation or asking it a serious technical question. As for benchmarks, in practice the chance that the question you ask the model is similar to the ones in a certain benchmark is quite low and hallucination rate will vary wildly depending on the type of question you ask as can been seen for instance by looking at the GPT5 systems card, where the hallucination scores range from 0.6% (LongFact-Concepts) all the way to 54% (HealthBench hard).
It might "suck at creative writing" from most fingertips ... but I just encountered a short story co-written by a Twin Cities artist and his "Molly" (his AI collaborator) that was truly poetic and surprising, top to bottom. That was an eye-opening first for me. In the right creative hands, we might have a new literary art form.
Would love to read it, link?
https://molly-verse.com/a-letter-from-mary-shelley/
Not that familiar with Mary Shelley? https://en.wikipedia.org/wiki/Mary_Shelley.
LLMs by their very nature are a powerful tool for writing and this includes creative writing.
However, as with any tool, you have to understand your craft to be able to use it effectively.
You can give a child a box of Acrylic paints and get a picture, but that is going to be a very different result compared to a Jonas Wood piece.
I’m really glad they’re focusing on lessening hallucinations and not trying to bury the fact they exist (as others corps are doing…)
Really good to see they’re still tracking this metric. Hope that continues
I wonder if a 0% rate is even possible or if Gary Marcus is correct when he says it isn't...
If the error rate is 0, there is no evolution, and no exploration of an adaptive landscape. You want the error rate to be low but non-zero. Low enough that your model of the world matches it well enough to be helpful, but not so low that you never discover a new insight.
Depends 100% on the application. Some *require* 0%. Others will be better off with 50%, even 100%!
I’d say for applications that require 0 error rate, LLMs are probably not the right tool.
Agreed. However, that's too reasonable for most people 😂
I'm not on 5 yet but my error and hallucination rate of 2025.210 is far higher than those percentages you gave. In law precedent in an Australian context I'd suggest it's near 80% or so. And if you correct it you stand a good chance of its correction also being made up.
I don’t think so. Since these are built on the human corpus I don’t think they can really achieve superhuman traits (aka no falsities or mistakes)
The main problem is something else: because the training policy is "predict the next token," it's very hard to ensure that somehow encodes the rule "always tell the truth" or "express uncertainty rather than bullshiting if you don't know." There's also conflict with the models being helpful and harmful besides honest (can the truth hurt me? Then how would it decide what to say?) It's a really hard challenge.
It’s impossible to measure beyond a certain level because even human experts don’t agree 100% of the time.
But hallucinations are measured against *known facts*
A lot of ground truth labels in datasets used for evaluation are based on things that are a bit fuzzier, e.g. the label is based on the input from 5 different human experts, and they may not all agree. This is common in medical datasets, for example. There’s a floor for label noise in any given dataset and it’s rarely 0%.
Agreed, and that's why you won't get 100% on any medical benchmark (or, if you do, you have a big problem), but we're talking here about the results on inference. Hallucination benchmarks are not like medical benchmarks; they're based on things we know for sure, not on things we may want to know or where there's no consensus - otherwise, how could we measure hallucinations at all!
Hallucination isn’t the same thing as factual incorrectness; it’s about the model producing outputs that are not representative of the training distribution. A lot of the data models are trained on is not factually consistent, or involves subjectivity (Does God exist? Does broccoli taste good?), and we do not always want factual answers to our prompts (“The year is 2050. What is the most important industry?”), so evaluation is not as simple as measuring whether the model returns factually correct answers—and in any case, noisy ground truth is inescapable.
There’s some good discussion in this paper: https://arxiv.org/abs/2504.17550
Agree that raising the floor is a better goal. Curious whether that will prove true with real use instead of benchmarks and training to the test. I’ll be impressed if they let skeptics test something before a release. Or if they used it to solve an actual problem that hasn’t been solved - like reducing data center energy and water use. Can’t bring myself to trust people who say they will cure cancer and save the planet if we just give them everything in the entire world.
Curious as well. I guess we will know once the memes start to come out lol. It will be fun of GPT-5 hallucinates just as much as the others...
It’s a great reduction but it’s still one in ten answers.
Indeed - and I doubt they will get it to 0%. There are key limitations with how these models are trained. Will they accept it or will they keep amassing money and GPUs while they can?
Thanks for the balanced review and your balanced approach throughout, Alberto.
How do you define creative writing?
Thank you Ido - basically every piece of writing where the process matters as much as the outcome at least (e.g. not an email). It's a broad category but you will find GPT-5 is not very good at it. The reason is that GPT-5 is primarily a mishmash of the older models, routing to them but not really a step change on writing skills
Thanks, Alberto. I am not sure I follow you, though. If someone writes a great creative novel/essay, but hates every moment of the process - or is solely focused on the end-product, is this not creative writing then?
Not trying to troll you. It's just that I grapple with these questions myself, and am trying to figure out the right way to frame it. To me it seems a greater distinguisher is the perspective taking and integration of personal experience that GPTs don't do well, but this is only a half-baked idea. Still trying to figure it out.
You are right. If someone paid me to write an essay about a topic I don't care about I may not enjoy it (although I like writing itself). But I can imagine someone hating the entire process and I'd still consider it creative writing. The definition is incomplete.
My understanding is that much effort was aimed at upping the ante on STEM, coding, math, and medical information. Makes my day, though I feel for the prose writers.
That's right - but why the prose writers?
This is a great review, the hallucinations work is very important, but hopefully more people know to take chatgpt outputs with a grain of salt. Thanks!
For me, the most important aspect of this hallucination / deception reduction is in the public perception of ChatGPT (and by association, all gen AI that gets lumped together in people’s minds). It makes any AI tool a harder sell, if you’re consulting. The hallucinations and “deception” (I disagree with the term as it applies here, but who am I?) really makes it difficult to have serious conversations with potential clients who are interested in finding out how AI can help them, but they’re convinced it’s unreliable. If they’re looking at a sizable $$$ outlay, how can they justify that for something that’s not only experimental, but also unreliable - and may open them up to legal liabilities? Cutting down on the appearance of unreliability can make client conversations a lot easier for those who can help get AI intelligently deployed with the people who really need it.
100% agreed. Solving this for good would radically change how people perceive generative AI's capabilities
Great breakdown, thank you for the Cliff Notes.😇
Thank you for reading!
I started skim reading dreading another tome of a post and it was refreshingly concise and easy to follow. tips imaginary hat
"26% smaller than GPT-4o (11.6% vs 20.6%) and 65% smaller than o3 when thinking (4.8% vs 22%)"
Its been a long week and my mental arithmetic deteriorates almost as fast as my spelin thanks to AI but is the above calculation a hallucination ?
You are right, I checked them and I mixed two different ways of measuring hallucinations without thinking twice haha it's *obviously* wrong. Will fix now - thanks!
Why? It's copied from the blog post/system card. Didn't check it tbh, expect they did it correctly lol
GPT5 still fails our random number test: https://chatbar-ai.com/?asi=Is%20it%20true%20that%20most%20AI%20models%20cannot%20correctly%20generate%20a%20random%20number%3F
We have also identified cases where it hallucinates horribly. We are currently investgating if its a general pronlem.
What is that link? It's not GPT-5?
its a link to our website explaining the observation. direct link here if you prefer https://business-landing.com/blog/81-chinese-ai-models-ai-commentators-confused
Just tried it out using older versions of ChatGPT and Claude. You’re right - ChatGPT gave me 37 and Claude 27.
I love the perspective of ‘raising the floor.’ It’s a very simple idea to digest. But from what I have found, it may lie less frequently, but it does it more confidently. To me that is far more dangerous than 20% obviously wrong (or qualified) outputs.
Yeah, there's work to do there still...
It's also a huge cost-saving exercise for OAI: If they've realised they can get performance improvements purely through spending more tokens on thinking, this is paid for by the user and not OAI.
Can I ask a Q: What's going on with the o3 hallucination/error rates? I thought this was OAI's joint-leading model? I'm kind of speechless at seeing how bad it performs on these graphs.
The whole reduced hallucination rate claim is hard to take seriously since by definition it all depends on what you ask the model. And even at 5% (the lowest value OpenAI reports for GPT5 models) you are almost guaranteed to end up with hallucinations in any long interactions.
What do you mean "depends on what you ask the model"? That's the idea of doing benchmarks, to draw a statistical conclusion about how often the model fails on similar questions. If it's 0% you can be sure it's a damn factual model *overall*.
There simply is a difference between asking a model "What is the capital of France?", to solve a mathematical equation or asking it a serious technical question. As for benchmarks, in practice the chance that the question you ask the model is similar to the ones in a certain benchmark is quite low and hallucination rate will vary wildly depending on the type of question you ask as can been seen for instance by looking at the GPT5 systems card, where the hallucination scores range from 0.6% (LongFact-Concepts) all the way to 54% (HealthBench hard).
Of course, we want hallucinations on *all benchmarks* to go down not one benchmark
Well that's quite a tall order...
It is, that's why companies won't try harder but they should
If 10s of billions of $$ were not enough there are probably limits to what can be done with LLMs/probabilistic generative models.