PhDs Fail This 5th-Grade Riddle! Can You Solve It?
Sorry for the clickbait title, except it's not
I. The simple puzzle that eludes the best AIs
Studying the skill level of artificial intelligence systems is revealing. A stark contrast emerges when they perform at their best vs. their worst. Or when they succeed at the most difficult challenges but fail at the easiest.
The two best models in the world, Anthropic’s Claude Sonnet 3.5 and OpenAI’s GPT-4o surpass the 50% mark on the hardest reasoning benchmark, the GPQA (graduate-level “Google-Proof Q&A”). Here are a couple of examples from the GPQA paper:
I’ve looked at a dozen questions from the benchmark (just for reference, I studied Aerospace Engineering and I’m a science enthusiast). My undergrad knowledge and expertise combined with a passion for learning would get me a ∼0%.
But while AI juggles quantum mechanics and organic chemistry, it struggles with this:
I’ve done a few IQ tests—I’m familiar with this kind of puzzle—and I’m not a kid anymore so I’m not really proud to say this but I saw the solution right away—took me literally a few seconds. It’s so easy that people who know how IQ tests work may try at first to find a harder solution than the actual one.
I’d bet an average fifth-grader can solve this. Smart younger kids would, too.
How can Claude 3.5 and GPT-4o solve PhD-level problems but fail this puzzle?