You are very kind to call the authors of this paper "the authors" and not "OpenAI corporate research." Otherwise it would read "the training and reinforcement learning processes that we, OpenAI, developed were not appropriate tests for the wide release of the product OpenAI launched."
Haha I know. I just think any org is a complex entity with many branches and I can't be sure whether there are internal differences (OpenAI has ~3,000 employees), so I chose not to be snarky and comment on the content of the paper instead
So if LLMs hallucinate because today’s training + evaluation pipelines reward guessing over admitting uncertainty, then does that mean that the labs are optimising models for benchmarks and/or training on them? I would guess so. Hence, benchmarks may not be very useful indicators of actual real-world capabilities. Which was already obvious, but this paper kind of indirectly confirms it, and it's from an AI lab itself.
Thanks for the clear write-up, Alberto. Hallucinations are tricky, because the stochasticity of an LLM's output should itself be context-dependent: If you ask a model to generate an imaginative children's story, you want it to be more "creative" by making shit up; whereas if you ask the model for historical facts, you don't want anything resembling hallucinations.
They mention this in the paper, and I agree, it should be context dependent, which makes it really tricky. I guess the best approach is to just solve the "making things up" problem and then go from there to let them make things up *in specific circumstances*
I don't think it's a good analogy: mansplaining is a cultural bias whereas hallucinations are a technical bias (at least that's the paper's argument; I think there's more to it)
هلوسة الذكاء الاصطناعي قد تكون ناتجة بشكل كبير من هلوسة المستخدمين فهو - الذكاء الاصطناعي- يهلوس عندما نريد نحن ان نهلوس من خلال المطالبات و الاوامر او حتى التعليمات الغريبة التي يطلبها المستخدم من نماذج الذكاء الاصطناعي .كما ان السؤال الجوهري الذي يفرض نفسه في هذه الاثناء ماذا يريد المستخدم بالضبط ؟ و هل يسعى الى ذكاء يفوق الذكاء البشري ؟ اذ هذا يبدو مستبعدا او حتى مستحيلا لان تقييم مخرجات و استجابات النماذج لا تقاس بالصح و الخطا بل بمعالجة و مساءلة المخرجات مهما كانت معطياتها و نتائجها بتدخلات بشرية - الذكاء البشري -من اجل الحصول على مخرجات مولدة من ايحاءات الذكاء الاصطناعي التي يجب ان تتفاعل و تتناغم مع توريات و ايحاءات المعالجة البشرية بالاعتماد على التقييم العميق و الدقيق.
To me the whole paper feels like a giant cope. "Look, it's not the technology that's broken, it's the evaluations."
Could be. They're not solving OOD with this approach that's for sure
You are very kind to call the authors of this paper "the authors" and not "OpenAI corporate research." Otherwise it would read "the training and reinforcement learning processes that we, OpenAI, developed were not appropriate tests for the wide release of the product OpenAI launched."
Haha I know. I just think any org is a complex entity with many branches and I can't be sure whether there are internal differences (OpenAI has ~3,000 employees), so I chose not to be snarky and comment on the content of the paper instead
So if LLMs hallucinate because today’s training + evaluation pipelines reward guessing over admitting uncertainty, then does that mean that the labs are optimising models for benchmarks and/or training on them? I would guess so. Hence, benchmarks may not be very useful indicators of actual real-world capabilities. Which was already obvious, but this paper kind of indirectly confirms it, and it's from an AI lab itself.
Yes, indeed. They're gaming the benchmarks for better acores and have been for years
Thanks for the clear write-up, Alberto. Hallucinations are tricky, because the stochasticity of an LLM's output should itself be context-dependent: If you ask a model to generate an imaginative children's story, you want it to be more "creative" by making shit up; whereas if you ask the model for historical facts, you don't want anything resembling hallucinations.
They mention this in the paper, and I agree, it should be context dependent, which makes it really tricky. I guess the best approach is to just solve the "making things up" problem and then go from there to let them make things up *in specific circumstances*
And yet 90% of people using these things don’t know how often they are wrong. Sigh.
Yes, surprisingly this is not as common knowledge as I thought
OK, so LLM’s tend to Mansplain, making up things because they “should” know the answers
I don't think it's a good analogy: mansplaining is a cultural bias whereas hallucinations are a technical bias (at least that's the paper's argument; I think there's more to it)
على التقييم العميق و الدقيق
هلوسة الذكاء الاصطناعي قد تكون ناتجة بشكل كبير من هلوسة المستخدمين فهو - الذكاء الاصطناعي- يهلوس عندما نريد نحن ان نهلوس من خلال المطالبات و الاوامر او حتى التعليمات الغريبة التي يطلبها المستخدم من نماذج الذكاء الاصطناعي .كما ان السؤال الجوهري الذي يفرض نفسه في هذه الاثناء ماذا يريد المستخدم بالضبط ؟ و هل يسعى الى ذكاء يفوق الذكاء البشري ؟ اذ هذا يبدو مستبعدا او حتى مستحيلا لان تقييم مخرجات و استجابات النماذج لا تقاس بالصح و الخطا بل بمعالجة و مساءلة المخرجات مهما كانت معطياتها و نتائجها بتدخلات بشرية - الذكاء البشري -من اجل الحصول على مخرجات مولدة من ايحاءات الذكاء الاصطناعي التي يجب ان تتفاعل و تتناغم مع توريات و ايحاءات المعالجة البشرية بالاعتماد على التقييم العميق و الدقيق.