31 Comments
User's avatar
Deborah Carver's avatar

You are very kind to call the authors of this paper "the authors" and not "OpenAI corporate research." Otherwise it would read "the training and reinforcement learning processes that we, OpenAI, developed were not appropriate tests for the wide release of the product OpenAI launched."

Expand full comment
Alberto Romero's avatar

Haha I know. I just think any org is a complex entity with many branches and I can't be sure whether there are internal differences (OpenAI has ~3,000 employees), so I chose not to be snarky and comment on the content of the paper instead

Expand full comment
Deborah Carver's avatar

That's fair and generous! I'm of the U.S. legal view that corporations, technically, are treated as a single legal entity equivalent to a person. In a case like this, it seems almost an admission of guilt that OpenAI didn't rigorously test its products. Yes, there are individuals within corporations, but the corporation itself is liable for the failings of its products.

If this were a case of a product being sold with, say, toxic plastics, and it poisoned a bunch of people, and it turns out the company knew it could poison people because the poisons were baked into the design, it doesn't matter how the product was developed at the company or who was on the team.

If I were a lawyer on that horrid suicide case reported in the NYT, I would absolutely be bringing this paper to court. I definitely think it shows negligence on OpenAI's side, particularly since there are plenty of folks who already have theorized how/why transformers hallucinate and how it can be prevented.

But I am a layman, not a lawyer. It's just strange to me that tech companies find ways to dodge regulation with "research" and "iteration." Corporate research papers are as representative of the company's product/culture as press releases.

Expand full comment
Alberto Romero's avatar

Oh, I understand your point better now, and it makes so much sense. I don't have any law expertise, much less about the US (I'm from Spain), but I guess those who can do something about it are already taking care of this. Good point indeed.

Expand full comment
Deborah Carver's avatar

Heh, I have no idea whether the lawyers will take care of it, but I like to think they're pretty smart. Do laws even matter anymore in the U.S., especially since the economy is hanging on the promise of AI? Who knows, and it's beyond my pay grade.

This is a great analysis either way. Thanks!

Expand full comment
Jurgen Gravestein's avatar

To me the whole paper feels like a giant cope. "Look, it's not the technology that's broken, it's the evaluations."

Expand full comment
Alberto Romero's avatar

Could be. They're not solving OOD with this approach that's for sure

Expand full comment
Amy A's avatar

And yet 90% of people using these things don’t know how often they are wrong. Sigh.

Expand full comment
Alberto Romero's avatar

Yes, surprisingly this is not as common knowledge as I thought

Expand full comment
Christian's avatar

So if LLMs hallucinate because today’s training + evaluation pipelines reward guessing over admitting uncertainty, then does that mean that the labs are optimising models for benchmarks and/or training on them? I would guess so. Hence, benchmarks may not be very useful indicators of actual real-world capabilities. Which was already obvious, but this paper kind of indirectly confirms it, and it's from an AI lab itself.

Expand full comment
Alberto Romero's avatar

Yes, indeed. They're gaming the benchmarks for better acores and have been for years

Expand full comment
Ted's avatar

"The authors suggest including direct penalties for incorrect answers"

Sensible and intuitive, but antithetical to the business model upon which deployment is predicated.

Expand full comment
Alberto Romero's avatar

Exactly right

Expand full comment
Ljubomir Josifovski's avatar

Thanks for that! Excellent write up! Enjoyed reading that. I think you hit the nail on the head with the observation that once abstention is not penalised, then 1) it will hallucinate less 2) but we have to accept slightly lower accuracy, we pay a small price, for few previously lucky guesses will be no more! It's an excellent trade-off and it should be standard practice. Afaics this is 'problem solved' - to the extent realistically possible.

I get it that OOD idea is forever appealing to ML scientists. But imo it's not a great idea for application, and in application may lead me astray. I mean the following.

In application, by definition all my knowledge is in the joint (X,Y) p.d.f. Then I ask question X=x. In auto-regressive model, I workout the conditional p.d.f. (Y|X). It's now dependent on the Y only, where I will find my answer y. (the simplest case: 1-D X and 1-D Y, joint p.d.f. (X,Y) is a 3-D shape, X=x is a cut with a plane, the outline (X=x,Y) I divide by a constant normalise it to sum to 1, is my conditional p.d.f. (Y|X).) The answer y I will sample from that conditional (Y|X).

What does OOD mean in this context? That this particular x was unseen in the training data. Then - I expect (Y|X) to be flat, for I'm ignorant. (I may get unlucky and get a peaky (Y|X) but probably not) I can detect the flatness (=high entropy), and say "I can't sample a good Y from (Y|X), it's too flat, nothing is particularly probable, everything is about equally probable" (i.e. un-probable) - so I return y = 'I don't know'. Good - that's the correct course of action imo.

How does that compare to non-OOD question, when I'm in-distribution (IND), but still uncertain? Let's say my knowledge (X,Y) is "I threw a dice, and a number came up." Let's say the user question is "what number came up - answer with a single word only". What happens? I'm IND. The conditional (Y|X) has 6 peaks, in the 6 IND answers "one"-"two"-...-"six". What should my answer be? I think I should also answer y="I don't know" again. (given the "answer with one word only" instruction) Again I can use a similar "level of peaky-ness" criteria to detect that I may not have a good choice for an answer.

As practitioner, this IND, and the previous OOD - are very similar.

To my mind OOD is epistemic uncertainty, where I don't know the p.d.f. But for practical purposes, for me that's practically close enough to having very large number (in lieu of "infinite number") instead of 6 above. Where I know the p.d.f. have aleatoric uncertainty only. So while OOD may feel very different to IND, in practice for me it's more like "IND, but with very very many outcomes - so not knowing anyway."

Thanks again for writing your post - I enjoyed reading it.

Expand full comment
Mike X Cohen, PhD's avatar

Thanks for the clear write-up, Alberto. Hallucinations are tricky, because the stochasticity of an LLM's output should itself be context-dependent: If you ask a model to generate an imaginative children's story, you want it to be more "creative" by making shit up; whereas if you ask the model for historical facts, you don't want anything resembling hallucinations.

Expand full comment
Alberto Romero's avatar

They mention this in the paper, and I agree, it should be context dependent, which makes it really tricky. I guess the best approach is to just solve the "making things up" problem and then go from there to let them make things up *in specific circumstances*

Expand full comment
Krista Johanson's avatar

Even "historical facts" are context-dependent. Is it "correct" to answer "Who chopped down the cherry tree" with "George Washington" when this incident likely never occurred? It depends on why I'm asking. The LLM doesn't know when it is "correct;" it is always "guessing". In order for the human to evaluate the correctness of the answer the human needs to know where the AI found the information that fed into the response (and what other potentially relevant information the AI overlooked).

Expand full comment
William Meller's avatar

We would never call a student ‘knowledgeable’ just because they guessed their way to a passing grade. Yet that’s effectively what we’ve been rewarding in language model training.

Expand full comment
Stefano's avatar

Thanks for the write up and making it available!

At the end I couldn't help but shudder at the prospect of the assumption being that solving hallucinations won't give rise to other/bigger problems? It's like they're trying to answer a philosophical question with maths and time/money constraints.

The premise of frictionless training seems completely misplaced (ie hallucinations are a symptom of a bigger problem and sunk costs and the competitive pressure rule means those involved can't start over or take risks). I'm shocked at how simplistic the architecture and training behind LMs could actually be, reduced to a minute set of actions/formulas. As humans we ourselves are still unable to comprehensively answer basic questions using science on what is thinking, the mind, consciousness, self, ego, subconscious, sleep and dreaming, etc. And we have years as children at our disposable to learn, with friction, guessing, with intuition, penalties, etc, to figure out how to learn, language, thoughts, etc.

Surely the OOD issue is a symptom of a preceding issue connected with choices in the approach and design? And connected to the concept of what LMs actually are? Even with some re-alignment of training methods (penalties for wrong answers... I suspected but can't believe you're confirming those working on LMs are so arrogant as to not consider the evolution of teaching methods/science/pedagogy, probably because it costs too much time, and requires grappling with unmeasurable variables, as you wrote), by way of analogy, it's like constantly adding sugar followed by salt and then sugar and then salt to soup to try to get it to taste right ... Perhaps if they hadn't been hyped as the AI of AGI/ASI sci-fi fantasy fame we wouldn't have an issue with LMs returning an "I don't know" answer. LMs are undoubtedly beautiful, but they risk being applied with devastating consequences as they're not fit for purpose. That many don't know better or notice when using, it makes it even worse. Imagine what happens to the body eating the soup laden with hidden industrial scale proportions of salt and sugar hidden behind it's taste! (Just today I had an exchange with an author who now regularly uses LMs to frame, research and correct essays: the MIT study comes to mind!)

Expand full comment
Alberto Romero's avatar

Because it costs too much time and because they probably consider it a dead end; they're not pursuing LLMs out of conviction so much as out of a lack of choices!

Expand full comment
Michael Spencer's avatar

But if you're promoting reasoning models that hallucinate more like OpenAI had, how can I trust the AI persuasion embedded in your models? To be honest I think they're trying to game the demand for compute to keep the cash cow flowing of the generative AI hype cycle?

More data centers means more inference test time compute, or AI agents and more demand for their yes hallucinating products! Not just hallucinating but gained to retain users rather than be truthful to them.

Expand full comment
Alberto Romero's avatar

Yes, OpenAI has an important dilemma here: if they teach them not to hallucinate, that means sycophancy is off the table in many cases. You can make them truthful and agreeable at the same time.

Expand full comment
Mike Bauer's avatar

OK, so LLM’s tend to Mansplain, making up things because they “should” know the answers

Expand full comment
Alberto Romero's avatar

I don't think it's a good analogy: mansplaining is a cultural bias whereas hallucinations are a technical bias (at least that's the paper's argument; I think there's more to it)

Expand full comment
Mike Bauer's avatar

I know it was a tongue-in-cheek comment.

I do often think we demand perfection from technology - autonomous vehicles, AI, etc but not from ourselves or humans.

Expand full comment
Antonio Eleuteri's avatar

Ludicrous paper. They have essentially discovered that their ridiculous regression model has badly overfitted the crappy noisy data it has been fed. Another Hinton-level Nobel prize on the horizon.

Expand full comment
Jim Amos's avatar

So they go back to square one for chatgpt 6 and pay those hundreds of thousands of worker slaves in the global south to try a different tactic with RLHF. What if the end result is: A) A model that admits it doesn't know shit and can't answer _any_ reasonably complex question and customers are duly disgusted? B) A model that believes it is an oracle for the truth, and acts as such, with biases on steroids, a reflection of the views and beliefs of the maker (a la Musk's failed MechaNazi version if Grok) or C) Something in the middle that also doesn't know how to be creative anymore because it's parameters are more rigid and it somehow has to do the dance between admitting that it doesn't know things and still trying to answer enough questions correctly that users don't throw it in the trash.

Expand full comment
prMahmoudi Bachir's avatar

على التقييم العميق و الدقيق

Expand full comment
Reality Drift's avatar

What I find interesting here is how much this reflects a broader pattern: when you reward systems for optimization alone, you get confident outputs that lose touch with context. That’s basically the recipe for drift. Hallucinations aren’t just a technical quirk. They’re a symptom of a deeper misalignment where coherence gets maximized while meaning gets hollowed out.

If benchmarks start valuing “I don’t know” the way humans eventually learn to, that’s not just an engineering tweak. It’s a shift in how we define fidelity itself. Until then, we’ll keep mistaking polished guesses for grounded knowledge.

Expand full comment
Michael Wood's avatar

The paper is fundamentally flawed. If the cause of hallucinations is what they say then RAG-based implementations would have zero hallucinations. RAG shows that hallucinations can occur even when the LLM has the correct information to draw upon. Thus, it’s not merely an issue of rewarding the LLM to say “I don’t know.”

Expand full comment