OpenAI o3 and o4-Mini Are More Impressive Than I Expected
But not for the reasons you may think
Last week, I wrote that “Google Is Winning on Every AI Front.” A lot of people—both in and out of the industry—agreed. I hinted that if Google kept pushing forward at this pace, OpenAI might not be able to catch up. But that was more provocation than prophecy. We knew o3 and o4-mini were on the way.
What I didn’t know was how much OpenAI had leveled up o3 since its December 2024 announcement. Or how impressive—and surprisingly cost-effective—o4-mini would be. Suddenly, it might no longer be Google in a league of its own. We’re back to a proper rivalry. Can o3 and o4-mini meaningfully challenge Google’s dominance?
On the one hand, Google, with its massive scale, deep infrastructure, limitless funding, and Demis Hassabis—a scientist’s scientist. On the other hand, Sam Altman, a marketing force with a higher tolerance for risk and a team that still owns the user experience.
Let’s see how o3 and o4 might shift this delicate balance. At the very least, they deserve a fair match-up against Gemini 2.5—the reigning world champion. A victor will rise above the noise today.
The key novelty is not about numbers
We all love benchmarks, metrics, and watching one company outpace another with each new release. Those towering bars in performance graphs feel like proof—visual and intuitive—of who’s winning. They're easy to share and easy to remember. In that sense, OpenAI’s launch of o3 and o4-mini delivers. But maybe for the first time, that’s not the most impressive part of an AI release.
If you watched the livestream—or follow OpenAI staff on Twitter—it was pretty clear what they actually wanted to highlight. But in case you missed it (or avoid Twitter for very good reasons), here’s the point: o3 and o4-mini have been trained to combine perception, action, and reasoning into a single, integrated system.
I’ll explain in a moment why that matters—and why I think Google won’t be far behind—but first, let’s look at what OpenAI shared in their blog post (along with this deeper research post on “thinking with images”):
For the first time, our reasoning models can agentically use and combine every tool within ChatGPT—this includes searching the web, analyzing uploaded files and other data with Python, reasoning deeply about visual inputs, and even generating images.
Up to now, reasoners like o1 could think through complex problems, but they couldn’t do much. Meanwhile, agents like Deep Research or Operator could use a tool, but only focus on that tool in isolation. What OpenAI has done with o3 and o4-mini is fuse these abilities. By integrating deep reasoning with multimodal perception and dynamic tool use, they’ve created something more powerful than the sum of its parts.
Critically, these models are trained to reason about when and how to use tools to produce detailed and thoughtful answers in the right output formats, typically in under a minute, to solve more complex problems. This allows them to tackle multi-faceted questions more effectively, a step toward a more agentic ChatGPT that can independently execute tasks on your behalf.
Researchers used reinforcement learning during post-training to help the models learn when and how to choose from their available tools—web browsing, Python, image generation, file analysis—based on the outcome they’re trying to achieve.
Their ability to deploy tools based on desired outcomes makes them more capable in open-ended situations—particularly those involving visual reasoning and multi-step workflows. This improvement is reflected both in academic benchmarks and real-world tasks, as reported by early testers.
Here are a few tweets that underline what OpenAI is emphasizing with this release: less about leaderboard scores, more about the trinity of perception, action, and thinking. It’s about how it feels to interact with something that can see, reason, and act in seamless coordination:
This is why I say this launch isn’t really about “o3 is better than Gemini 2.5 Pro” (even if it is, as we’ll see in the next section, though some early testers still place them in the same ballpark). The real shift is that it opens a new dimension in how we think about AI.
(Still, it wouldn’t be wise to take the word of the people profiting from o3 about how good o3 is. Use the model yourself; that’s the only metric that matters.)
Mandatory caveats: none of these models—o3, o4-mini, or Gemini 2.5 Pro—are fully reliable. And intelligence remains jagged: they can master tasks we call “hard” while still hallucinating reasoning steps we’d never invent, or fumbling surprisingly simple things. The problems they create are also changing; new capabilities bring new blind spots. So this isn’t about infallibility. It’s about scope.
o3 and o4-mini are the first AI systems to approach full interactivity across three layers: modalities (perception), tools (action), and disciplines (cognition). Senses, limbs, cortex.
In a way, this release marks the end of “AI model” as a useful category. We kept calling them models out of habit. But these should have been called systems all along. When multimodal reasoning and extended tool use become part of the core reasoning loop, the word “model” starts to feel too small. A human isn’t a model. A human is a system.
o3 and o4-mini resemble, more than anything before them, how humans operate: sensing, thinking, remembering, and acting in a continuous feedback loop (this kind of “online learning” is still in the works, though). We don’t toggle between seeing and reasoning, or between using tools and understanding outcomes. We do all of it, all the time, interleaved. AI now does too.
How far this resemblance goes, though, remains to be seen. (Tyler Cowen calls o3 “AGI” but I disagree: no AGI is this dumb at times.) We’ll need far more time—and far more rigorous testing—to understand the real scope of what we’ve just been handed. That’s the next chapter.