All You Need to Know About Google Gemini 1.5 (Hint: It's More Important Than Sora)
A week after releasing Gemini Ultra 1.0, Google is back with exciting news
Let’s remark that again: Gemini 1.5 is more important than Sora.
Last week, on Feb. 15, in what I thought was an aggressively successful marketing move, Sam Altman and the Sora team upstaged Google’s announcement of Gemini 1.5 by flooding X with visually appealing AI-generated videos.
They won (again) the public fight against Google, but clever people like you and me should look past marketing tactics. So here’s my take: Gemini 1.5’s breakthrough is more profound and will have far-reaching implications.
For months, Google and DeepMind have been hibernating while OpenAI bewitched its users with new features and hints at a bright future—AI agents, web search, gargantuan fundraising—but it wasn’t for nothing. They’ve awakened with determination (even if OpenAI remains even more determined to steal the spotlight).
After releasing Gemini Advanced (with Gemini 1.0 Ultra) two weeks ago, they announced Gemini 1.5 last week (technical report). Gemini 1.5 “shows dramatic improvements across a number of dimensions and [the 1.5 Pro version] achieves comparable quality to 1.0 Ultra, while using less compute.”
Three key takeaways:
The Gemini 1.5 version being released is Pro, not Ultra (which will likely come out sometime in 2024). Although it’s not better than Gemini Advanced, it’s worth checking out anyway if only for points two and three below.
Gemini 1.5, in contrast to Gemini 1.0, is a multimodal sparse Mixture of Experts (MoE), like GPT-4 (OpenAI never confirmed this info, but it’s consensus now). This means it can be very performant while keeping a low latency.
Gemini 1.5 works with a 1-million-token context window (up to 10M in research). This version of Gemini 1.5 is being rolled out to devs and enterprise customers first (the rest of us get a 128K-token window). The previous longest-window model was Anthropic’s Claude 2.1 at 200K. That’s a 5-50x increase.
If you don’t care about technical details, that’s sufficient. Know that Gemini 1.5 Ultra (once it comes out) will likely be better than anything else by far. The question remains if Google will announce it before or after OpenAI announces GPT-5.
If you care about what a 1-10M-token context window implies and why I consider it a more important breakthrough than OpenAI Sora, let’s get into a bit more detail.
Gemini 1.5 Pro: A multimodal Mixture of Experts
First, an interesting question: Why did Google withhold Gemini 1.5 Ultra?
My original guess was that Google wanted to release Pro first to get a better assessment of the 1M context window in the wild without the additional capabilities that go with a larger model. After Sora came out, my guess updated to a 4D-chess move by Google, which would’ve incited OpenAI to release something big after Gemini 1.5 Pro to then top them off with 1.5 Ultra. My current guess is Ockham’s razor at play: Elections are coming so better be careful.
If I’m right, both Gemini 1.5 Ultra and GPT-5 will come out at the very end of 2024.
Now, before I explain the 1-10M context window, which is the true breakthrough, let me review the performance comparisons and the architecture.
One surprising result is that Gemini 1.5 Pro is ∼Gemini 1.0 Ultra across benchmarks (see table below). That means the experts of the MoE are, individually, as good as larger dense models like Ultra 1.0, while remaining faster and less compute-intensive (remember that the experts in MoE are activated independently depending on the result of a routing function that decides which expert answers which input query).
MoE architectures seem to be the path forward. OpenAI first, then Mistral, and now Google have proved sparse MoE models far outmatch the corresponding dense LLMs. It’s important to keep in mind that dense and sparse models aren’t comparable apples-to-apples but Google has tested the performance in long-context multimodal tasks of 1.0 and 1.5 versions anyway so here’s an overview:
Despite the context window being much larger for Gemini 1.5 than any other model (either by Google or its competitors) performance remains largely intact. From the technical report, “This leap in long-context performance does not come at the expense of the core multi-modal capabilities of the model.”
A hint of what’s to come.
The breakthrough: 1-10M-token context window
If we set aside the context window breakthrough, Gemini 1.5 is not so different from GPT-4; a high-quality multimodal MoE. Perhaps inference latency and the fact that Google uses custom hardware are notable differences, but not worth an entire post or much attention by the community.
The 1-10 million-token context window is, however, the most important technical leap so far in 2024—more than OpenAI Sora, although it is harder to see why. Here’s my attempt at explaining it.