The GPT-4 Generation: Why Are the Best AI Models Equally Intelligent?
A qualitative exploration of GPT-4, Gemini Ultra, and Claude 3 Opus
This is Anthropic’s week. They released the long-awaited Claude 3 (Opus, Sonnet, and Haiku). I won’t go over the technical details, as others have done it already better than I could. Instead, I’ll focus on Opus specifically and say, as a starting point, that it is indeed better than GPT-4 across benchmarks. Yet, just like people said about Gemini Ultra, not by much.
This article is my attempt at answering the Big Question in AI right now: Why do the best AI models all gather around the same level of intelligence (using performance as a quantitative proxy for intelligence) and why that level is GPT-4’s?
Besides GPT-4, Google’s Gemini Ultra (from here, Gemini), and Anthropic’s Claude 3 Opus (from here Claude 3) belong to that class of super AI models but I could add, extending the cluster in space and time, Mistral Large and the upcoming Meta’s Llama 3. (Let’s keep GPT-5 out of this conversation for reasons we’ll understand soon.)
This isn’t to say the models are exactly the same. They’re not. They have a weird, almost alien kind of idiosyncrasy best described by Ethan Mollick in a couple of recent posts on Gemini vs GPT-4 and the prompting receptiveness of AI models. Here’s what he wrote about Gemini and GPT-4’s personalities:
Despite these personality differences, it is remarkable how compatible these two very different models are. Complex prompts that work in GPT-4 work in Gemini, and vice-versa… with some interesting exceptions that line up with the personality.
Some prompts work on both; others don’t. The same model tested with similar prompts—at least to the human eye—might give you contrasting results. Their behavior is, at times, humanly-unpredictable. Here’s another excerpt by Mollick:
Prompts can make huge differences in outcomes, even if we don’t always know in advance about which prompt will work best. It is not uncommon to see good prompts make a task that was impossible for the LLM into one that is easy for it.
So it’s no surprise equally intelligent models don’t always show the same strengths and weaknesses or even the same behavioral traits. Just like equally intelligent humans don’t either. Why is that?
How can similar intelligence yield different behavior?
The behavioral differences that GPT-4, Gemini, and Claude 3 display can respond to many reasons. But before we get into that let me clarify that some tests are purported to reflect differences but don’t—especially results pressumed to be novel, gathered from anecdotal evidence that was, perhaps unknown to the claimer, also gathered, in a similarly non-rigorous way, for other models. “Claude 3 is AGI” is the most common. There’s also the “Claude 3 is self-aware” one, which is also old (and I mean old). Not saying they’re false (although I could argue for that) just saying they aren’t differences.
Model’s receptiveness to prompting techniques, subtle variations in prompts, and even their personality, which Mollick explored, are perfectly valid reasons that users can discern from their tests. Others are found in deeper and deeper layers of the models.
The most superficial layer besides user-level interaction is the system prompt—a instruction set that AI companies add to the models to steer their behavior and set some safety boundaries. Here’s Claude 3’s. Just compare that to ChatGPT’s system prompt. Do you think they’ll yield the same responses to a given prompt? Most style and tone differences might be due to this. (I guess waiting one year to see all the mistakes OpenAI would made paid off for Anthropic.)
If we dig further—which we can’t, actually, because these models, the three of them, are extremely closed—or at least make directionally correct speculations, we can find differences at the level of training and possibly architecture and data. It’s at those lower strata that the differences become more striking.