You are very kind to call the authors of this paper "the authors" and not "OpenAI corporate research." Otherwise it would read "the training and reinforcement learning processes that we, OpenAI, developed were not appropriate tests for the wide release of the product OpenAI launched."
Haha I know. I just think any org is a complex entity with many branches and I can't be sure whether there are internal differences (OpenAI has ~3,000 employees), so I chose not to be snarky and comment on the content of the paper instead
That's fair and generous! I'm of the U.S. legal view that corporations, technically, are treated as a single legal entity equivalent to a person. In a case like this, it seems almost an admission of guilt that OpenAI didn't rigorously test its products. Yes, there are individuals within corporations, but the corporation itself is liable for the failings of its products.
If this were a case of a product being sold with, say, toxic plastics, and it poisoned a bunch of people, and it turns out the company knew it could poison people because the poisons were baked into the design, it doesn't matter how the product was developed at the company or who was on the team.
If I were a lawyer on that horrid suicide case reported in the NYT, I would absolutely be bringing this paper to court. I definitely think it shows negligence on OpenAI's side, particularly since there are plenty of folks who already have theorized how/why transformers hallucinate and how it can be prevented.
But I am a layman, not a lawyer. It's just strange to me that tech companies find ways to dodge regulation with "research" and "iteration." Corporate research papers are as representative of the company's product/culture as press releases.
Oh, I understand your point better now, and it makes so much sense. I don't have any law expertise, much less about the US (I'm from Spain), but I guess those who can do something about it are already taking care of this. Good point indeed.
Heh, I have no idea whether the lawyers will take care of it, but I like to think they're pretty smart. Do laws even matter anymore in the U.S., especially since the economy is hanging on the promise of AI? Who knows, and it's beyond my pay grade.
So if LLMs hallucinate because today’s training + evaluation pipelines reward guessing over admitting uncertainty, then does that mean that the labs are optimising models for benchmarks and/or training on them? I would guess so. Hence, benchmarks may not be very useful indicators of actual real-world capabilities. Which was already obvious, but this paper kind of indirectly confirms it, and it's from an AI lab itself.
Thanks for that! Excellent write up! Enjoyed reading that. I think you hit the nail on the head with the observation that once abstention is not penalised, then 1) it will hallucinate less 2) but we have to accept slightly lower accuracy, we pay a small price, for few previously lucky guesses will be no more! It's an excellent trade-off and it should be standard practice. Afaics this is 'problem solved' - to the extent realistically possible.
I get it that OOD idea is forever appealing to ML scientists. But imo it's not a great idea for application, and in application may lead me astray. I mean the following.
In application, by definition all my knowledge is in the joint (X,Y) p.d.f. Then I ask question X=x. In auto-regressive model, I workout the conditional p.d.f. (Y|X). It's now dependent on the Y only, where I will find my answer y. (the simplest case: 1-D X and 1-D Y, joint p.d.f. (X,Y) is a 3-D shape, X=x is a cut with a plane, the outline (X=x,Y) I divide by a constant normalise it to sum to 1, is my conditional p.d.f. (Y|X).) The answer y I will sample from that conditional (Y|X).
What does OOD mean in this context? That this particular x was unseen in the training data. Then - I expect (Y|X) to be flat, for I'm ignorant. (I may get unlucky and get a peaky (Y|X) but probably not) I can detect the flatness (=high entropy), and say "I can't sample a good Y from (Y|X), it's too flat, nothing is particularly probable, everything is about equally probable" (i.e. un-probable) - so I return y = 'I don't know'. Good - that's the correct course of action imo.
How does that compare to non-OOD question, when I'm in-distribution (IND), but still uncertain? Let's say my knowledge (X,Y) is "I threw a dice, and a number came up." Let's say the user question is "what number came up - answer with a single word only". What happens? I'm IND. The conditional (Y|X) has 6 peaks, in the 6 IND answers "one"-"two"-...-"six". What should my answer be? I think I should also answer y="I don't know" again. (given the "answer with one word only" instruction) Again I can use a similar "level of peaky-ness" criteria to detect that I may not have a good choice for an answer.
As practitioner, this IND, and the previous OOD - are very similar.
To my mind OOD is epistemic uncertainty, where I don't know the p.d.f. But for practical purposes, for me that's practically close enough to having very large number (in lieu of "infinite number") instead of 6 above. Where I know the p.d.f. have aleatoric uncertainty only. So while OOD may feel very different to IND, in practice for me it's more like "IND, but with very very many outcomes - so not knowing anyway."
Thanks again for writing your post - I enjoyed reading it.
Thanks for the clear write-up, Alberto. Hallucinations are tricky, because the stochasticity of an LLM's output should itself be context-dependent: If you ask a model to generate an imaginative children's story, you want it to be more "creative" by making shit up; whereas if you ask the model for historical facts, you don't want anything resembling hallucinations.
They mention this in the paper, and I agree, it should be context dependent, which makes it really tricky. I guess the best approach is to just solve the "making things up" problem and then go from there to let them make things up *in specific circumstances*
Even "historical facts" are context-dependent. Is it "correct" to answer "Who chopped down the cherry tree" with "George Washington" when this incident likely never occurred? It depends on why I'm asking. The LLM doesn't know when it is "correct;" it is always "guessing". In order for the human to evaluate the correctness of the answer the human needs to know where the AI found the information that fed into the response (and what other potentially relevant information the AI overlooked).
We would never call a student ‘knowledgeable’ just because they guessed their way to a passing grade. Yet that’s effectively what we’ve been rewarding in language model training.
At the end I couldn't help but shudder at the prospect of the assumption being that solving hallucinations won't give rise to other/bigger problems? It's like they're trying to answer a philosophical question with maths and time/money constraints.
The premise of frictionless training seems completely misplaced (ie hallucinations are a symptom of a bigger problem and sunk costs and the competitive pressure rule means those involved can't start over or take risks). I'm shocked at how simplistic the architecture and training behind LMs could actually be, reduced to a minute set of actions/formulas. As humans we ourselves are still unable to comprehensively answer basic questions using science on what is thinking, the mind, consciousness, self, ego, subconscious, sleep and dreaming, etc. And we have years as children at our disposable to learn, with friction, guessing, with intuition, penalties, etc, to figure out how to learn, language, thoughts, etc.
Surely the OOD issue is a symptom of a preceding issue connected with choices in the approach and design? And connected to the concept of what LMs actually are? Even with some re-alignment of training methods (penalties for wrong answers... I suspected but can't believe you're confirming those working on LMs are so arrogant as to not consider the evolution of teaching methods/science/pedagogy, probably because it costs too much time, and requires grappling with unmeasurable variables, as you wrote), by way of analogy, it's like constantly adding sugar followed by salt and then sugar and then salt to soup to try to get it to taste right ... Perhaps if they hadn't been hyped as the AI of AGI/ASI sci-fi fantasy fame we wouldn't have an issue with LMs returning an "I don't know" answer. LMs are undoubtedly beautiful, but they risk being applied with devastating consequences as they're not fit for purpose. That many don't know better or notice when using, it makes it even worse. Imagine what happens to the body eating the soup laden with hidden industrial scale proportions of salt and sugar hidden behind it's taste! (Just today I had an exchange with an author who now regularly uses LMs to frame, research and correct essays: the MIT study comes to mind!)
Because it costs too much time and because they probably consider it a dead end; they're not pursuing LLMs out of conviction so much as out of a lack of choices!
But if you're promoting reasoning models that hallucinate more like OpenAI had, how can I trust the AI persuasion embedded in your models? To be honest I think they're trying to game the demand for compute to keep the cash cow flowing of the generative AI hype cycle?
More data centers means more inference test time compute, or AI agents and more demand for their yes hallucinating products! Not just hallucinating but gained to retain users rather than be truthful to them.
Yes, OpenAI has an important dilemma here: if they teach them not to hallucinate, that means sycophancy is off the table in many cases. You can make them truthful and agreeable at the same time.
I don't think it's a good analogy: mansplaining is a cultural bias whereas hallucinations are a technical bias (at least that's the paper's argument; I think there's more to it)
Ludicrous paper. They have essentially discovered that their ridiculous regression model has badly overfitted the crappy noisy data it has been fed. Another Hinton-level Nobel prize on the horizon.
So they go back to square one for chatgpt 6 and pay those hundreds of thousands of worker slaves in the global south to try a different tactic with RLHF. What if the end result is: A) A model that admits it doesn't know shit and can't answer _any_ reasonably complex question and customers are duly disgusted? B) A model that believes it is an oracle for the truth, and acts as such, with biases on steroids, a reflection of the views and beliefs of the maker (a la Musk's failed MechaNazi version if Grok) or C) Something in the middle that also doesn't know how to be creative anymore because it's parameters are more rigid and it somehow has to do the dance between admitting that it doesn't know things and still trying to answer enough questions correctly that users don't throw it in the trash.
What I find interesting here is how much this reflects a broader pattern: when you reward systems for optimization alone, you get confident outputs that lose touch with context. That’s basically the recipe for drift. Hallucinations aren’t just a technical quirk. They’re a symptom of a deeper misalignment where coherence gets maximized while meaning gets hollowed out.
If benchmarks start valuing “I don’t know” the way humans eventually learn to, that’s not just an engineering tweak. It’s a shift in how we define fidelity itself. Until then, we’ll keep mistaking polished guesses for grounded knowledge.
The paper is fundamentally flawed. If the cause of hallucinations is what they say then RAG-based implementations would have zero hallucinations. RAG shows that hallucinations can occur even when the LLM has the correct information to draw upon. Thus, it’s not merely an issue of rewarding the LLM to say “I don’t know.”
You are very kind to call the authors of this paper "the authors" and not "OpenAI corporate research." Otherwise it would read "the training and reinforcement learning processes that we, OpenAI, developed were not appropriate tests for the wide release of the product OpenAI launched."
Haha I know. I just think any org is a complex entity with many branches and I can't be sure whether there are internal differences (OpenAI has ~3,000 employees), so I chose not to be snarky and comment on the content of the paper instead
That's fair and generous! I'm of the U.S. legal view that corporations, technically, are treated as a single legal entity equivalent to a person. In a case like this, it seems almost an admission of guilt that OpenAI didn't rigorously test its products. Yes, there are individuals within corporations, but the corporation itself is liable for the failings of its products.
If this were a case of a product being sold with, say, toxic plastics, and it poisoned a bunch of people, and it turns out the company knew it could poison people because the poisons were baked into the design, it doesn't matter how the product was developed at the company or who was on the team.
If I were a lawyer on that horrid suicide case reported in the NYT, I would absolutely be bringing this paper to court. I definitely think it shows negligence on OpenAI's side, particularly since there are plenty of folks who already have theorized how/why transformers hallucinate and how it can be prevented.
But I am a layman, not a lawyer. It's just strange to me that tech companies find ways to dodge regulation with "research" and "iteration." Corporate research papers are as representative of the company's product/culture as press releases.
Oh, I understand your point better now, and it makes so much sense. I don't have any law expertise, much less about the US (I'm from Spain), but I guess those who can do something about it are already taking care of this. Good point indeed.
Heh, I have no idea whether the lawyers will take care of it, but I like to think they're pretty smart. Do laws even matter anymore in the U.S., especially since the economy is hanging on the promise of AI? Who knows, and it's beyond my pay grade.
This is a great analysis either way. Thanks!
To me the whole paper feels like a giant cope. "Look, it's not the technology that's broken, it's the evaluations."
Could be. They're not solving OOD with this approach that's for sure
And yet 90% of people using these things don’t know how often they are wrong. Sigh.
Yes, surprisingly this is not as common knowledge as I thought
So if LLMs hallucinate because today’s training + evaluation pipelines reward guessing over admitting uncertainty, then does that mean that the labs are optimising models for benchmarks and/or training on them? I would guess so. Hence, benchmarks may not be very useful indicators of actual real-world capabilities. Which was already obvious, but this paper kind of indirectly confirms it, and it's from an AI lab itself.
Yes, indeed. They're gaming the benchmarks for better acores and have been for years
"The authors suggest including direct penalties for incorrect answers"
Sensible and intuitive, but antithetical to the business model upon which deployment is predicated.
Exactly right
Thanks for that! Excellent write up! Enjoyed reading that. I think you hit the nail on the head with the observation that once abstention is not penalised, then 1) it will hallucinate less 2) but we have to accept slightly lower accuracy, we pay a small price, for few previously lucky guesses will be no more! It's an excellent trade-off and it should be standard practice. Afaics this is 'problem solved' - to the extent realistically possible.
I get it that OOD idea is forever appealing to ML scientists. But imo it's not a great idea for application, and in application may lead me astray. I mean the following.
In application, by definition all my knowledge is in the joint (X,Y) p.d.f. Then I ask question X=x. In auto-regressive model, I workout the conditional p.d.f. (Y|X). It's now dependent on the Y only, where I will find my answer y. (the simplest case: 1-D X and 1-D Y, joint p.d.f. (X,Y) is a 3-D shape, X=x is a cut with a plane, the outline (X=x,Y) I divide by a constant normalise it to sum to 1, is my conditional p.d.f. (Y|X).) The answer y I will sample from that conditional (Y|X).
What does OOD mean in this context? That this particular x was unseen in the training data. Then - I expect (Y|X) to be flat, for I'm ignorant. (I may get unlucky and get a peaky (Y|X) but probably not) I can detect the flatness (=high entropy), and say "I can't sample a good Y from (Y|X), it's too flat, nothing is particularly probable, everything is about equally probable" (i.e. un-probable) - so I return y = 'I don't know'. Good - that's the correct course of action imo.
How does that compare to non-OOD question, when I'm in-distribution (IND), but still uncertain? Let's say my knowledge (X,Y) is "I threw a dice, and a number came up." Let's say the user question is "what number came up - answer with a single word only". What happens? I'm IND. The conditional (Y|X) has 6 peaks, in the 6 IND answers "one"-"two"-...-"six". What should my answer be? I think I should also answer y="I don't know" again. (given the "answer with one word only" instruction) Again I can use a similar "level of peaky-ness" criteria to detect that I may not have a good choice for an answer.
As practitioner, this IND, and the previous OOD - are very similar.
To my mind OOD is epistemic uncertainty, where I don't know the p.d.f. But for practical purposes, for me that's practically close enough to having very large number (in lieu of "infinite number") instead of 6 above. Where I know the p.d.f. have aleatoric uncertainty only. So while OOD may feel very different to IND, in practice for me it's more like "IND, but with very very many outcomes - so not knowing anyway."
Thanks again for writing your post - I enjoyed reading it.
Thanks for the clear write-up, Alberto. Hallucinations are tricky, because the stochasticity of an LLM's output should itself be context-dependent: If you ask a model to generate an imaginative children's story, you want it to be more "creative" by making shit up; whereas if you ask the model for historical facts, you don't want anything resembling hallucinations.
They mention this in the paper, and I agree, it should be context dependent, which makes it really tricky. I guess the best approach is to just solve the "making things up" problem and then go from there to let them make things up *in specific circumstances*
Even "historical facts" are context-dependent. Is it "correct" to answer "Who chopped down the cherry tree" with "George Washington" when this incident likely never occurred? It depends on why I'm asking. The LLM doesn't know when it is "correct;" it is always "guessing". In order for the human to evaluate the correctness of the answer the human needs to know where the AI found the information that fed into the response (and what other potentially relevant information the AI overlooked).
We would never call a student ‘knowledgeable’ just because they guessed their way to a passing grade. Yet that’s effectively what we’ve been rewarding in language model training.
Thanks for the write up and making it available!
At the end I couldn't help but shudder at the prospect of the assumption being that solving hallucinations won't give rise to other/bigger problems? It's like they're trying to answer a philosophical question with maths and time/money constraints.
The premise of frictionless training seems completely misplaced (ie hallucinations are a symptom of a bigger problem and sunk costs and the competitive pressure rule means those involved can't start over or take risks). I'm shocked at how simplistic the architecture and training behind LMs could actually be, reduced to a minute set of actions/formulas. As humans we ourselves are still unable to comprehensively answer basic questions using science on what is thinking, the mind, consciousness, self, ego, subconscious, sleep and dreaming, etc. And we have years as children at our disposable to learn, with friction, guessing, with intuition, penalties, etc, to figure out how to learn, language, thoughts, etc.
Surely the OOD issue is a symptom of a preceding issue connected with choices in the approach and design? And connected to the concept of what LMs actually are? Even with some re-alignment of training methods (penalties for wrong answers... I suspected but can't believe you're confirming those working on LMs are so arrogant as to not consider the evolution of teaching methods/science/pedagogy, probably because it costs too much time, and requires grappling with unmeasurable variables, as you wrote), by way of analogy, it's like constantly adding sugar followed by salt and then sugar and then salt to soup to try to get it to taste right ... Perhaps if they hadn't been hyped as the AI of AGI/ASI sci-fi fantasy fame we wouldn't have an issue with LMs returning an "I don't know" answer. LMs are undoubtedly beautiful, but they risk being applied with devastating consequences as they're not fit for purpose. That many don't know better or notice when using, it makes it even worse. Imagine what happens to the body eating the soup laden with hidden industrial scale proportions of salt and sugar hidden behind it's taste! (Just today I had an exchange with an author who now regularly uses LMs to frame, research and correct essays: the MIT study comes to mind!)
Because it costs too much time and because they probably consider it a dead end; they're not pursuing LLMs out of conviction so much as out of a lack of choices!
But if you're promoting reasoning models that hallucinate more like OpenAI had, how can I trust the AI persuasion embedded in your models? To be honest I think they're trying to game the demand for compute to keep the cash cow flowing of the generative AI hype cycle?
More data centers means more inference test time compute, or AI agents and more demand for their yes hallucinating products! Not just hallucinating but gained to retain users rather than be truthful to them.
Yes, OpenAI has an important dilemma here: if they teach them not to hallucinate, that means sycophancy is off the table in many cases. You can make them truthful and agreeable at the same time.
OK, so LLM’s tend to Mansplain, making up things because they “should” know the answers
I don't think it's a good analogy: mansplaining is a cultural bias whereas hallucinations are a technical bias (at least that's the paper's argument; I think there's more to it)
I know it was a tongue-in-cheek comment.
I do often think we demand perfection from technology - autonomous vehicles, AI, etc but not from ourselves or humans.
Ludicrous paper. They have essentially discovered that their ridiculous regression model has badly overfitted the crappy noisy data it has been fed. Another Hinton-level Nobel prize on the horizon.
So they go back to square one for chatgpt 6 and pay those hundreds of thousands of worker slaves in the global south to try a different tactic with RLHF. What if the end result is: A) A model that admits it doesn't know shit and can't answer _any_ reasonably complex question and customers are duly disgusted? B) A model that believes it is an oracle for the truth, and acts as such, with biases on steroids, a reflection of the views and beliefs of the maker (a la Musk's failed MechaNazi version if Grok) or C) Something in the middle that also doesn't know how to be creative anymore because it's parameters are more rigid and it somehow has to do the dance between admitting that it doesn't know things and still trying to answer enough questions correctly that users don't throw it in the trash.
على التقييم العميق و الدقيق
What I find interesting here is how much this reflects a broader pattern: when you reward systems for optimization alone, you get confident outputs that lose touch with context. That’s basically the recipe for drift. Hallucinations aren’t just a technical quirk. They’re a symptom of a deeper misalignment where coherence gets maximized while meaning gets hollowed out.
If benchmarks start valuing “I don’t know” the way humans eventually learn to, that’s not just an engineering tweak. It’s a shift in how we define fidelity itself. Until then, we’ll keep mistaking polished guesses for grounded knowledge.
The paper is fundamentally flawed. If the cause of hallucinations is what they say then RAG-based implementations would have zero hallucinations. RAG shows that hallucinations can occur even when the LLM has the correct information to draw upon. Thus, it’s not merely an issue of rewarding the LLM to say “I don’t know.”