As shown on the screenspot pro benchmark, Gemini seems to be better than the other LLMS at image understanding. I think this is why it’s good at ARC. Even though you feed the LLMs ARC1 in text, the visualisation helps. Like when I see the puzzles it becomes a lot easier, so maybe, by allowing Gemini a better ‘visual brain’ it also does better.
As far as I know nano-banana is the best AI image generators/editor. AND Google is doing interesting video -> playable-world stuff. My theory is that they have figured out a way for better knowledge transfer between these models rather than just bolting image models and LLMS together.
One other question that stands out is how is Anthropic still so good at coding? Even with all these advances anthropic stays ahead in coding benchmarks and often seems to be the preferred choice for developers.
That's something I suspect as well. That Gemini is solving ARC-AGI with the help of its multimodal skills. About Anthropic, I really don't know. They focused exclusively on coding and had a bit of an advantage over the others.
Here are the three massive shifts happening right now:
1. Coding: From "Ingredients" to "The Meal" 👨💻
The Old Way (Gemini 1.5): You ask for a Pomodoro timer. The AI spits out 50 lines of Python code in a grey box. You have to copy it, paste it into an editor, run it, and debug the errors yourself.
The Gemini 3 Way: You ask for the timer. The AI builds and runs the app instantly right in the chat window. You click the buttons, use it immediately, and can say "make it blue" to update it in real-time. No copy-pasting required.
2. Planning: From "Walls of Text" to "Visual Dashboards" 🗺️
The Old Way: You ask to plan a 3-day trip to Rome. The AI writes a long, dry bulleted list of text. You have to read through it and manually Google the locations to see if they look good.
The Gemini 3 Way: The AI treats your screen like a canvas. It generates a visual magazine layout with an interactive map, photos of hotels, and a clickable schedule. You can tap a hotel to swap it out without ever leaving the chat.
3. Logic: From "Fast Answers" to "Deep Reasoning" 🧠
The Old Way: You ask a tricky riddle or a complex data question. The AI rushes to answer in 2 seconds, prioritizing speed over accuracy. It sounds confident, but it often misses the nuance or hallucinates.
The Gemini 3 Way: The AI pauses. It actually "thinks" first (simulating a chain of thought). It checks its own work—"Wait, that calculation looks wrong, let me convert the currency first"—and then delivers the solution.
The Takeaway: The friction of "translating" AI answers into real work is disappearing. We aren't just searching for information anymore; we are generating working solutions.
Appreciated the deep dive. Your breakdown of Gemini 3’s ARC-AGI 2 leap really clarifies why this isn’t just another model update. The scale of improvement is remarkable and worth noting for anyone tracking frontier AI progress.
Impressive performance indeed. But I’m not as impressed with the gain on AGI-ARC, at least not to the point of being convinced that this is a major breakthrough. It’s a good benchmark sure (at least, as far as we can tell) but couldn’t Gemini-3 have been “trained to the test?” We know the general patterns used in ARC-AGI, so if they created some samples on their own and post-trained with reinforcement learning, I would expect that to provide a big jump in performance - a jump similar to what we see here.
Not trying to take anything away from Google’s accomplishments with Gemini, but as you say we shouldn’t put too much credence on any one benchmark.
If it was this easy don't you think everyone would have aced it by now? No, ARC-AGI is specifically designed so that you can barely get any gains even by targeting the train set. That's what Chollet means when he says it requires fluid vs crystallized intelligence.
Good counterpoint. RL isn’t a magic bullet so I suspect it wouldn’t lead to sudden excellent performance, but it could provide to more than incremental gains. Or I’m wrong, they’re not doing that, and it is primarily due to better visual processing.
I dislike how they changed the ability to set the amount of thinking context on AI studio. Instead of being able to ramp it up all the way to 32K for Gemini 3 it’s either set at high and low, and I have a sneaking suspicion that high is 8K.
of course the solution is to test it with finding grain control via the API however, doesn’t that just defeat the purpose of the AI studio then
As shown on the screenspot pro benchmark, Gemini seems to be better than the other LLMS at image understanding. I think this is why it’s good at ARC. Even though you feed the LLMs ARC1 in text, the visualisation helps. Like when I see the puzzles it becomes a lot easier, so maybe, by allowing Gemini a better ‘visual brain’ it also does better.
As far as I know nano-banana is the best AI image generators/editor. AND Google is doing interesting video -> playable-world stuff. My theory is that they have figured out a way for better knowledge transfer between these models rather than just bolting image models and LLMS together.
One other question that stands out is how is Anthropic still so good at coding? Even with all these advances anthropic stays ahead in coding benchmarks and often seems to be the preferred choice for developers.
That's something I suspect as well. That Gemini is solving ARC-AGI with the help of its multimodal skills. About Anthropic, I really don't know. They focused exclusively on coding and had a bit of an advantage over the others.
I recall you speculated this would happen quite some time ago in your posts
Indeed Michel, Google has been the obvious bet to me for a long while!
Too much tech jargon...
So I asked G3:
Here are the three massive shifts happening right now:
1. Coding: From "Ingredients" to "The Meal" 👨💻
The Old Way (Gemini 1.5): You ask for a Pomodoro timer. The AI spits out 50 lines of Python code in a grey box. You have to copy it, paste it into an editor, run it, and debug the errors yourself.
The Gemini 3 Way: You ask for the timer. The AI builds and runs the app instantly right in the chat window. You click the buttons, use it immediately, and can say "make it blue" to update it in real-time. No copy-pasting required.
2. Planning: From "Walls of Text" to "Visual Dashboards" 🗺️
The Old Way: You ask to plan a 3-day trip to Rome. The AI writes a long, dry bulleted list of text. You have to read through it and manually Google the locations to see if they look good.
The Gemini 3 Way: The AI treats your screen like a canvas. It generates a visual magazine layout with an interactive map, photos of hotels, and a clickable schedule. You can tap a hotel to swap it out without ever leaving the chat.
3. Logic: From "Fast Answers" to "Deep Reasoning" 🧠
The Old Way: You ask a tricky riddle or a complex data question. The AI rushes to answer in 2 seconds, prioritizing speed over accuracy. It sounds confident, but it often misses the nuance or hallucinates.
The Gemini 3 Way: The AI pauses. It actually "thinks" first (simulating a chain of thought). It checks its own work—"Wait, that calculation looks wrong, let me convert the currency first"—and then delivers the solution.
The Takeaway: The friction of "translating" AI answers into real work is disappearing. We aren't just searching for information anymore; we are generating working solutions.
It can do much better than that
Gemini still hallucinates a lot
https://x.com/artificialanlys/status/1990926803087892506?s=46&t=dJjY0Wbsd_ZAxyIR8vVDtw
Appreciated the deep dive. Your breakdown of Gemini 3’s ARC-AGI 2 leap really clarifies why this isn’t just another model update. The scale of improvement is remarkable and worth noting for anyone tracking frontier AI progress.
Impressive performance indeed. But I’m not as impressed with the gain on AGI-ARC, at least not to the point of being convinced that this is a major breakthrough. It’s a good benchmark sure (at least, as far as we can tell) but couldn’t Gemini-3 have been “trained to the test?” We know the general patterns used in ARC-AGI, so if they created some samples on their own and post-trained with reinforcement learning, I would expect that to provide a big jump in performance - a jump similar to what we see here.
Not trying to take anything away from Google’s accomplishments with Gemini, but as you say we shouldn’t put too much credence on any one benchmark.
If it was this easy don't you think everyone would have aced it by now? No, ARC-AGI is specifically designed so that you can barely get any gains even by targeting the train set. That's what Chollet means when he says it requires fluid vs crystallized intelligence.
Good counterpoint. RL isn’t a magic bullet so I suspect it wouldn’t lead to sudden excellent performance, but it could provide to more than incremental gains. Or I’m wrong, they’re not doing that, and it is primarily due to better visual processing.
I dislike how they changed the ability to set the amount of thinking context on AI studio. Instead of being able to ramp it up all the way to 32K for Gemini 3 it’s either set at high and low, and I have a sneaking suspicion that high is 8K.
of course the solution is to test it with finding grain control via the API however, doesn’t that just defeat the purpose of the AI studio then