How Good Is Google Gemini Advanced?
Some people say it's GPT-4 level. Others are deeply disappointed. Who's right?
Google has released Gemini Ultra. They’ve also rebranded Bard into Gemini. What was Gemini Ultra is now Ultra 1.0 (the language model). What was Bard Advanced is now Gemini Advanced (the chatbot).
Here’s a quick review of the official info in case you’ve missed the launch.
Gemini Advanced costs $19.99/month (like ChatGPT with GPT-4) and it’s free for the first two months, which is more than enough time to test it thoroughly and decide if you like it or not (try it here). You can also stick to Gemini Pro, which is free.
Gemini is available through Google One. The subscription includes other benefits like storage and seamless integration (to come soon) with Google’s services, like Gmail, Docs, Sheets, etc. (formerly known as Duet AI). Google’s goal is to integrate its best AIs with its widely-used services to outcompete OpenAI.
Gemini Advanced is also available as an app for Android (if you opt-in through Google Assistant, you can access the chatbot that way) and iOS. It’s been launched in 150 countries in English.
Google’s blog post says Gemini Advanced is the preferred option “in blind evaluations with our third-party raters,” which sounds similar to the LMSys arena but without the important part: no transparency. Like GPT-4, Gemini Advanced is multimodal, has data analysis capabilities, and improved reasoning compared to Gemini Pro.
The strange contrast among Gemini users
That was the objective part, taken directly from Google’s announcement. For a subjective overview of Gemini’s behavior, you can take a look at Ethan Mollick’s notes.
Mollick was given early access to Gemini Advanced and tested it over six weeks, comparing it with GPT-4 across prompt settings and task categories (so far he’s written about Gemini’s linguistic and reasoning skills, not multimodality or code).
His first conclusion should give us an idea of what to expect from both his notes and from Gemini itself: “Gemini Advanced is clearly a GPT-4 class model,” qualified by a subtitle: “Gemini Advanced does not obviously blow away GPT-4 in the benchmarks.”
Mollick shares many examples that back up this level-headed statement. In some cases, Gemini is better. In others, GPT-4 is. He argues comparing them can yield insight as to what the GPT-4 class of models are generally capable of while still being distinctive and having strengths that apply differently across tasks:
GPT-4 is much more sophisticated about using code and accomplishes a number of hard verbal tasks better—it writes a better sestina and passes the Apple Test. Gemini is better at explanations and does a great job integrating images and search.
He emphasizes there’s room for improvement. Both systems fail more than we’d like and still hallucinate. Interestingly, he says they have distinct personalities yet remain compatible at the prompt level. In a way, he’s making an analogy: GPT-4 and Gemini Advanced are similar and distinct just like two similarly clever people with different personalities are.
His conclusion is an open ending:
Gemini shows that Google is in the AI race for real, and that other companies besides OpenAI can build GPT-4 class models. And we now know something about AI that we didn’t before. Advanced LLMs may show some basic similarities in prompts and responses that make it easy for people to switch to the most advanced AI from an older model at any time. Plus, GPT-4’s “spark” is not unique to OpenAI, but is something that might often happen with scale.
I trust Mollick’s review. He’s more rigorous than most people and he’s had six weeks to craft these reflections. While he’s not saying Gemini Advanced is clearly better than GPT-4 (as Google has claimed a few times), he’s stating they’re in the same ballpark.
However, now that Gemini Advanced is open for everyone, a strangely contrasting picture emerges. Users who’ve finally put their hands (and prompts) onto Google’s most powerful chatbot don’t seem to reach the same conclusion.
I’ve scrolled through a few social media platforms looking for anecdotal evidence to elucidate the general perception toward Gemini Advanced and my conclusion is as straightforward as Mollick’s but of opposite sentiment.
People are extremely disappointed.
Here’s a r/Singularity user: “I have been playing around with it and comparing it to GPT4, and all across the board, GPT4 is much more accurate, seems to have a much greater knowledge base, and does not hallucinate as much.” Here’s Carlos Santana, from dotCSV, showing how Gemini Advanced fails the feather-lead weight test that ChatGPT gets right (here’s a different version with quantity). Here’s another Redditor showing how Gemini fails the apple test while ChatGPT nails it. GPT-3.5 answers correctly a reasoning test about mirrored letters but Gemini doesn’t. Perhaps you want to search for something? Maybe play rock-paper-scissors?
I could go on and on. Reddit and Twitter are filled with these and it’s been just a few hours. This can be interpreted in two ways: “It’s been only a few hours, give it more time!” or “It’s been only a few hours, how it’s clearly so much worse already?”
Mollick isn’t alone in his moderate praise of Gemini Advanced, though. François Chollet (admittedly biased given that he’s a Googler) says this: “I’ve been using Gemini Advanced for coding help for a while, and it’s really good.” It’s safe to disregard his bias because saying this in public wouldn’t make any sense if it weren’t genuine—people can now try the tool themselves.
So, what’s going on? Why is there such a notable discrepancy between Mollick, Chollet, or the story Google tells us and casual users?
A few hypotheses that might solve the riddle
Evaluating language models and chatbots is hard. Traditional benchmarking is not the same as a blind leaderboard arena, which is not the same as informal testing for six weeks, which is not the same as a few hours of intentionally tricky prompting.
Benchmark-wise, the Ultra version of Gemini supposedly beat GPT-4 on 30 out of 32 tasks, a number that Sissie Hsiao, Google’s VP and general manager of Gemini experiences, repeated today in an interview for LinkedIn News Tech Stack. Mollick says GPT-4 and Gemini Advanced are similar performance-wise but different personality-wise. Most users who’ve publicly expressed their perception are very disappointed at the low quality of Gemini’s responses. (There’s no ELO score yet for Gemini in the LMSys arena, which will be a critical data point to reach conclusions.)
There is no possible conclusion from that pile of mixed evaluations!
Here are a few hypotheses that could help explain this conflict. I won’t elaborate on them too much. Once I have more evidence I will get back to this. Treat them as what they are, hypotheses that fit the evidence so far but are nowhere near conclusive.
GPT-4 is better prepared to handle tricky tests. Most users I’ve read about today claim to have a “go-to” question to compare models’ abilities. GPT-4 is 1.5 years old so it wouldn’t be surprising if OpenAI simply took care of those go-to problematic questions over months of constant fine-tuning. It’s well-known they’ve patched specific problems before if people complained on social media. This isn’t a judgment on OpenAI’s ways. On the one hand, they’re self-aware. Good. On the other, it can be misleading because they’re solving a particular instance but not the deeper cause. Perhaps Google hasn’t done the same and this is reflected in Gemini’s bad performance.
Gemini is worse at reasoning and that’s what people are seeing. People tend to evaluate chatbots on reasoning tasks first because that’s what humans consider harder. Gemini, as Mollick said, is worse than GPT-4 in that particular area but not in others that are typically explored later. This could be part of that jagged frontier that Mollick himself conceptualized a few months ago. Its jaggedness might emerge not only between humans and AIs, as he meant it, but also among AIs of the same category, like GPT-4 and Gemini.
People are publishing only the worst results out of anti-Google bias. There’s a weird general distrust of Google. It’s not weird because it’s unwarranted—Google tried to sell us a Gemini demo that was heavily edited. It’s weird because OpenAI isn’t necessarily better. Sam Altman displays intentionally confusing messaging and users constantly complain that GPT-4 is getting lazier over time. I believe the difference between people’s sentiment toward Google vs OpenAI is simply that OpenAI is much more responsive toward user feedback. Google feels more opaque and impermeable. The result is a covert resentment that resurfaces every time Gemini gets something wrong.
People who think Gemini works fine don’t go online to rant. This is perhaps the simplest explanation of all. Social media isn’t a reflection of the real world. The picture we get from checking sentiment online can be drastically different from the reality offline. Perhaps Gemini works fine for most users but they don’t go to X to post about it—it just doesn’t get them dunk points. What’s left is a very biased picture that reflects only Gemini’s unsatisfactory behavior.
That’s all for today.
I focused this write-up on Gemini Advanced because it was released today but applies to any LLM or chatbot. It’s just hard to evaluate them correctly. Benchmarking is deeply unreliable for reasons I’ve described before. And anecdotal evidence is just that, anecdotal.
Thorough, rigorous testing is your best friend, as it is the LMSys leaderboard arena. Once Gemini Advanced is up in the arena and gets an ELO score, we’ll decide if it deserves to be vindicated.
Lately, people treat new AI chatbots like kids who torture their new Christmas toys. They poke and stretch them to see if, and when, they’ll break. We endlessly pick at language models to uncover flaws. Come on. Do we gain anything meaningful from this constant prodding? Or does it only briefly satisfy our petty urges?
We’ve become spoiled children surrounded by AI gifts. Obsessing over their imperfections serves little purpose, at least short-term. We would benefit more from marveling at their super-powers before investigating their failures.
Let's wait for rigorous benchmarks to decide if AI Santa truly delivered.
This is a good writeup. My very early thoughts are that the majority of the early reactions are from power users, who are likely to ask it to do the logic tricks or deep iterative tasks (the areas where Ultra seems weakest). Google is playing the long game and a different style/flavor of GPT-4 level ability + integration with Google data may be good enough for most.