If the error rate is 0, there is no evolution, and no exploration of an adaptive landscape. You want the error rate to be low but non-zero. Low enough that your model of the world matches it well enough to be helpful, but not so low that you never discover a new insight.
The main problem is something else: because the training policy is "predict the next token," it's very hard to ensure that somehow encodes the rule "always tell the truth" or "express uncertainty rather than bullshiting if you don't know." There's also conflict with the models being helpful and harmful besides honest (can the truth hurt me? Then how would it decide what to say?) It's a really hard challenge.
It might "suck at creative writing" from most fingertips ... but I just encountered a short story co-written by a Twin Cities artist and his "Molly" (his AI collaborator) that was truly poetic and surprising, top to bottom. That was an eye-opening first for me. In the right creative hands, we might have a new literary art form.
Agree that raising the floor is a better goal. Curious whether that will prove true with real use instead of benchmarks and training to the test. I’ll be impressed if they let skeptics test something before a release. Or if they used it to solve an actual problem that hasn’t been solved - like reducing data center energy and water use. Can’t bring myself to trust people who say they will cure cancer and save the planet if we just give them everything in the entire world.
My understanding is that much effort was aimed at upping the ante on STEM, coding, math, and medical information. Makes my day, though I feel for the prose writers.
This is a great review, the hallucinations work is very important, but hopefully more people know to take chatgpt outputs with a grain of salt. Thanks!
For me, the most important aspect of this hallucination / deception reduction is in the public perception of ChatGPT (and by association, all gen AI that gets lumped together in people’s minds). It makes any AI tool a harder sell, if you’re consulting. The hallucinations and “deception” (I disagree with the term as it applies here, but who am I?) really makes it difficult to have serious conversations with potential clients who are interested in finding out how AI can help them, but they’re convinced it’s unreliable. If they’re looking at a sizable $$$ outlay, how can they justify that for something that’s not only experimental, but also unreliable - and may open them up to legal liabilities? Cutting down on the appearance of unreliability can make client conversations a lot easier for those who can help get AI intelligently deployed with the people who really need it.
Indeed - and I doubt they will get it to 0%. There are key limitations with how these models are trained. Will they accept it or will they keep amassing money and GPUs while they can?
Thank you Ido - basically every piece of writing where the process matters as much as the outcome at least (e.g. not an email). It's a broad category but you will find GPT-5 is not very good at it. The reason is that GPT-5 is primarily a mishmash of the older models, routing to them but not really a step change on writing skills
I love the perspective of ‘raising the floor.’ It’s a very simple idea to digest. But from what I have found, it may lie less frequently, but it does it more confidently. To me that is far more dangerous than 20% obviously wrong (or qualified) outputs.
I’m really glad they’re focusing on lessening hallucinations and not trying to bury the fact they exist (as others corps are doing…)
Really good to see they’re still tracking this metric. Hope that continues
I wonder if a 0% rate is even possible or if Gary Marcus is correct when he says it isn't...
If the error rate is 0, there is no evolution, and no exploration of an adaptive landscape. You want the error rate to be low but non-zero. Low enough that your model of the world matches it well enough to be helpful, but not so low that you never discover a new insight.
Depends 100% on the application. Some *require* 0%. Others will be better off with 50%, even 100%!
I’d say for applications that require 0 error rate, LLMs are probably not the right tool.
Agreed. However, that's too reasonable for most people 😂
I don’t think so. Since these are built on the human corpus I don’t think they can really achieve superhuman traits (aka no falsities or mistakes)
The main problem is something else: because the training policy is "predict the next token," it's very hard to ensure that somehow encodes the rule "always tell the truth" or "express uncertainty rather than bullshiting if you don't know." There's also conflict with the models being helpful and harmful besides honest (can the truth hurt me? Then how would it decide what to say?) It's a really hard challenge.
It’s impossible to measure beyond a certain level because even human experts don’t agree 100% of the time.
But hallucinations are measured against *known facts*
It might "suck at creative writing" from most fingertips ... but I just encountered a short story co-written by a Twin Cities artist and his "Molly" (his AI collaborator) that was truly poetic and surprising, top to bottom. That was an eye-opening first for me. In the right creative hands, we might have a new literary art form.
Would love to read it, link?
Agree that raising the floor is a better goal. Curious whether that will prove true with real use instead of benchmarks and training to the test. I’ll be impressed if they let skeptics test something before a release. Or if they used it to solve an actual problem that hasn’t been solved - like reducing data center energy and water use. Can’t bring myself to trust people who say they will cure cancer and save the planet if we just give them everything in the entire world.
Curious as well. I guess we will know once the memes start to come out lol. It will be fun of GPT-5 hallucinates just as much as the others...
My understanding is that much effort was aimed at upping the ante on STEM, coding, math, and medical information. Makes my day, though I feel for the prose writers.
That's right - but why the prose writers?
This is a great review, the hallucinations work is very important, but hopefully more people know to take chatgpt outputs with a grain of salt. Thanks!
For me, the most important aspect of this hallucination / deception reduction is in the public perception of ChatGPT (and by association, all gen AI that gets lumped together in people’s minds). It makes any AI tool a harder sell, if you’re consulting. The hallucinations and “deception” (I disagree with the term as it applies here, but who am I?) really makes it difficult to have serious conversations with potential clients who are interested in finding out how AI can help them, but they’re convinced it’s unreliable. If they’re looking at a sizable $$$ outlay, how can they justify that for something that’s not only experimental, but also unreliable - and may open them up to legal liabilities? Cutting down on the appearance of unreliability can make client conversations a lot easier for those who can help get AI intelligently deployed with the people who really need it.
100% agreed. Solving this for good would radically change how people perceive generative AI's capabilities
It’s a great reduction but it’s still one in ten answers.
Indeed - and I doubt they will get it to 0%. There are key limitations with how these models are trained. Will they accept it or will they keep amassing money and GPUs while they can?
Thanks for the balanced review and your balanced approach throughout, Alberto.
How do you define creative writing?
Thank you Ido - basically every piece of writing where the process matters as much as the outcome at least (e.g. not an email). It's a broad category but you will find GPT-5 is not very good at it. The reason is that GPT-5 is primarily a mishmash of the older models, routing to them but not really a step change on writing skills
Great breakdown, thank you for the Cliff Notes.😇
Thank you for reading!
I started skim reading dreading another tome of a post and it was refreshingly concise and easy to follow. tips imaginary hat
"26% smaller than GPT-4o (11.6% vs 20.6%) and 65% smaller than o3 when thinking (4.8% vs 22%)"
Its been a long week and my mental arithmetic deteriorates almost as fast as my spelin thanks to AI but is the above calculation a hallucination ?
Why? It's copied from the blog post/system card. Didn't check it tbh, expect they did it correctly lol
GPT5 still fails our random number test: https://chatbar-ai.com/?asi=Is%20it%20true%20that%20most%20AI%20models%20cannot%20correctly%20generate%20a%20random%20number%3F
We have also identified cases where it hallucinates horribly. We are currently investgating if its a general pronlem.
What is that link? It's not GPT-5?
its a link to our website explaining the observation. direct link here if you prefer https://business-landing.com/blog/81-chinese-ai-models-ai-commentators-confused
Just tried it out using older versions of ChatGPT and Claude. You’re right - ChatGPT gave me 37 and Claude 27.
I love the perspective of ‘raising the floor.’ It’s a very simple idea to digest. But from what I have found, it may lie less frequently, but it does it more confidently. To me that is far more dangerous than 20% obviously wrong (or qualified) outputs.
Yeah, there's work to do there still...