34 Comments
User's avatar
Jake Handy's avatar

I’m really glad they’re focusing on lessening hallucinations and not trying to bury the fact they exist (as others corps are doing…)

Really good to see they’re still tracking this metric. Hope that continues

Expand full comment
Alberto Romero's avatar

I wonder if a 0% rate is even possible or if Gary Marcus is correct when he says it isn't...

Expand full comment
Aristotle Evangelos's avatar

If the error rate is 0, there is no evolution, and no exploration of an adaptive landscape. You want the error rate to be low but non-zero. Low enough that your model of the world matches it well enough to be helpful, but not so low that you never discover a new insight.

Expand full comment
Alberto Romero's avatar

Depends 100% on the application. Some *require* 0%. Others will be better off with 50%, even 100%!

Expand full comment
Aristotle Evangelos's avatar

I’d say for applications that require 0 error rate, LLMs are probably not the right tool.

Expand full comment
Alberto Romero's avatar

Agreed. However, that's too reasonable for most people 😂

Expand full comment
Jake Handy's avatar

I don’t think so. Since these are built on the human corpus I don’t think they can really achieve superhuman traits (aka no falsities or mistakes)

Expand full comment
Alberto Romero's avatar

The main problem is something else: because the training policy is "predict the next token," it's very hard to ensure that somehow encodes the rule "always tell the truth" or "express uncertainty rather than bullshiting if you don't know." There's also conflict with the models being helpful and harmful besides honest (can the truth hurt me? Then how would it decide what to say?) It's a really hard challenge.

Expand full comment
Daniel Situnayake's avatar

It’s impossible to measure beyond a certain level because even human experts don’t agree 100% of the time.

Expand full comment
Alberto Romero's avatar

But hallucinations are measured against *known facts*

Expand full comment
othertomahern's avatar

It might "suck at creative writing" from most fingertips ... but I just encountered a short story co-written by a Twin Cities artist and his "Molly" (his AI collaborator) that was truly poetic and surprising, top to bottom. That was an eye-opening first for me. In the right creative hands, we might have a new literary art form.

Expand full comment
Alberto Romero's avatar

Would love to read it, link?

Expand full comment
Amy A's avatar

Agree that raising the floor is a better goal. Curious whether that will prove true with real use instead of benchmarks and training to the test. I’ll be impressed if they let skeptics test something before a release. Or if they used it to solve an actual problem that hasn’t been solved - like reducing data center energy and water use. Can’t bring myself to trust people who say they will cure cancer and save the planet if we just give them everything in the entire world.

Expand full comment
Alberto Romero's avatar

Curious as well. I guess we will know once the memes start to come out lol. It will be fun of GPT-5 hallucinates just as much as the others...

Expand full comment
Harold Toups's avatar

My understanding is that much effort was aimed at upping the ante on STEM, coding, math, and medical information. Makes my day, though I feel for the prose writers.

Expand full comment
Alberto Romero's avatar

That's right - but why the prose writers?

Expand full comment
AI for All Tomorrows's avatar

This is a great review, the hallucinations work is very important, but hopefully more people know to take chatgpt outputs with a grain of salt. Thanks!

Expand full comment
KayStoner's avatar

For me, the most important aspect of this hallucination / deception reduction is in the public perception of ChatGPT (and by association, all gen AI that gets lumped together in people’s minds). It makes any AI tool a harder sell, if you’re consulting. The hallucinations and “deception” (I disagree with the term as it applies here, but who am I?) really makes it difficult to have serious conversations with potential clients who are interested in finding out how AI can help them, but they’re convinced it’s unreliable. If they’re looking at a sizable $$$ outlay, how can they justify that for something that’s not only experimental, but also unreliable - and may open them up to legal liabilities? Cutting down on the appearance of unreliability can make client conversations a lot easier for those who can help get AI intelligently deployed with the people who really need it.

Expand full comment
Alberto Romero's avatar

100% agreed. Solving this for good would radically change how people perceive generative AI's capabilities

Expand full comment
Maria Sukhareva's avatar

It’s a great reduction but it’s still one in ten answers.

Expand full comment
Alberto Romero's avatar

Indeed - and I doubt they will get it to 0%. There are key limitations with how these models are trained. Will they accept it or will they keep amassing money and GPUs while they can?

Expand full comment
Ido Hartogsohn's avatar

Thanks for the balanced review and your balanced approach throughout, Alberto.

How do you define creative writing?

Expand full comment
Alberto Romero's avatar

Thank you Ido - basically every piece of writing where the process matters as much as the outcome at least (e.g. not an email). It's a broad category but you will find GPT-5 is not very good at it. The reason is that GPT-5 is primarily a mishmash of the older models, routing to them but not really a step change on writing skills

Expand full comment
ElandPrincess's avatar

Great breakdown, thank you for the Cliff Notes.😇

Expand full comment
Alberto Romero's avatar

Thank you for reading!

Expand full comment
ElandPrincess's avatar

I started skim reading dreading another tome of a post and it was refreshingly concise and easy to follow. tips imaginary hat

Expand full comment
Chris's avatar

"26% smaller than GPT-4o (11.6% vs 20.6%) and 65% smaller than o3 when thinking (4.8% vs 22%)"

Its been a long week and my mental arithmetic deteriorates almost as fast as my spelin thanks to AI but is the above calculation a hallucination ?

Expand full comment
Alberto Romero's avatar

Why? It's copied from the blog post/system card. Didn't check it tbh, expect they did it correctly lol

Expand full comment
brodrick justice's avatar

GPT5 still fails our random number test: https://chatbar-ai.com/?asi=Is%20it%20true%20that%20most%20AI%20models%20cannot%20correctly%20generate%20a%20random%20number%3F

We have also identified cases where it hallucinates horribly. We are currently investgating if its a general pronlem.

Expand full comment
Alberto Romero's avatar

What is that link? It's not GPT-5?

Expand full comment
brodrick justice's avatar

its a link to our website explaining the observation. direct link here if you prefer https://business-landing.com/blog/81-chinese-ai-models-ai-commentators-confused

Expand full comment
Anu's avatar

Just tried it out using older versions of ChatGPT and Claude. You’re right - ChatGPT gave me 37 and Claude 27.

Expand full comment
Fire & Algebra's avatar

I love the perspective of ‘raising the floor.’ It’s a very simple idea to digest. But from what I have found, it may lie less frequently, but it does it more confidently. To me that is far more dangerous than 20% obviously wrong (or qualified) outputs.

Expand full comment
Alberto Romero's avatar

Yeah, there's work to do there still...

Expand full comment