14 Comments
User's avatar
Jurgen Gravestein's avatar

To me the whole paper feels like a giant cope. "Look, it's not the technology that's broken, it's the evaluations."

Expand full comment
Alberto Romero's avatar

Could be. They're not solving OOD with this approach that's for sure

Expand full comment
Deborah Carver's avatar

You are very kind to call the authors of this paper "the authors" and not "OpenAI corporate research." Otherwise it would read "the training and reinforcement learning processes that we, OpenAI, developed were not appropriate tests for the wide release of the product OpenAI launched."

Expand full comment
Alberto Romero's avatar

Haha I know. I just think any org is a complex entity with many branches and I can't be sure whether there are internal differences (OpenAI has ~3,000 employees), so I chose not to be snarky and comment on the content of the paper instead

Expand full comment
Christian's avatar

So if LLMs hallucinate because today’s training + evaluation pipelines reward guessing over admitting uncertainty, then does that mean that the labs are optimising models for benchmarks and/or training on them? I would guess so. Hence, benchmarks may not be very useful indicators of actual real-world capabilities. Which was already obvious, but this paper kind of indirectly confirms it, and it's from an AI lab itself.

Expand full comment
Alberto Romero's avatar

Yes, indeed. They're gaming the benchmarks for better acores and have been for years

Expand full comment
Mike X Cohen's avatar

Thanks for the clear write-up, Alberto. Hallucinations are tricky, because the stochasticity of an LLM's output should itself be context-dependent: If you ask a model to generate an imaginative children's story, you want it to be more "creative" by making shit up; whereas if you ask the model for historical facts, you don't want anything resembling hallucinations.

Expand full comment
Alberto Romero's avatar

They mention this in the paper, and I agree, it should be context dependent, which makes it really tricky. I guess the best approach is to just solve the "making things up" problem and then go from there to let them make things up *in specific circumstances*

Expand full comment
Amy A's avatar

And yet 90% of people using these things don’t know how often they are wrong. Sigh.

Expand full comment
Alberto Romero's avatar

Yes, surprisingly this is not as common knowledge as I thought

Expand full comment
Mike Bauer's avatar

OK, so LLM’s tend to Mansplain, making up things because they “should” know the answers

Expand full comment
Alberto Romero's avatar

I don't think it's a good analogy: mansplaining is a cultural bias whereas hallucinations are a technical bias (at least that's the paper's argument; I think there's more to it)

Expand full comment
prMahmoudi Bachir's avatar

على التقييم العميق و الدقيق

Expand full comment
prMahmoudi Bachir's avatar

هلوسة الذكاء الاصطناعي قد تكون ناتجة بشكل كبير من هلوسة المستخدمين فهو - الذكاء الاصطناعي- يهلوس عندما نريد نحن ان نهلوس من خلال المطالبات و الاوامر او حتى التعليمات الغريبة التي يطلبها المستخدم من نماذج الذكاء الاصطناعي .كما ان السؤال الجوهري الذي يفرض نفسه في هذه الاثناء ماذا يريد المستخدم بالضبط ؟ و هل يسعى الى ذكاء يفوق الذكاء البشري ؟ اذ هذا يبدو مستبعدا او حتى مستحيلا لان تقييم مخرجات و استجابات النماذج لا تقاس بالصح و الخطا بل بمعالجة و مساءلة المخرجات مهما كانت معطياتها و نتائجها بتدخلات بشرية - الذكاء البشري -من اجل الحصول على مخرجات مولدة من ايحاءات الذكاء الاصطناعي التي يجب ان تتفاعل و تتناغم مع توريات و ايحاءات المعالجة البشرية بالاعتماد على التقييم العميق و الدقيق.

Expand full comment