Congratulations on the Studio Ghibli Collection, OpenAI, but Not Much Else
We are all blithely losing the plot
ChatGPT's new image-generation capabilities are outstanding. Just log into X and search for “ChatGPT” or “Ghibli”—people are reasonably freaking out and loving it. If I were an influencer, I’d call this a “ChatGPT moment”, if nothing else in attention capture. But while we’re in awe, it feels only fair to mention Google Gemini 2.5 Pro, a new state-of-the-art AI model that deserves at least as much attention, although it will get none. Since my job is to obey the crowd, I say we, too, forget about it.
(I genuinely like Google DeepMind’s work—one of my controversial predictions is that they will be on top by EOY—but they need to step up their marketing game. Whereas Sam sounds like a Zoomer, Demis and Sundar sound like boomers… trying to be Zoomers. Some unsolicited advice from a late millennial: you have to have at least one team that’s not taking itself so seriously, that’s not doing things old school through committees but through vibes. My two cents.)
Anyway, ChatGPT is now the best image-generation model (slightly better than Google's, from my experience). Character consistency, precise editing, text rendering, and some other weird experiments are superb. It's also the best at following instructions from context and inferring intent from implicit cues (multi-turn generation). That's the benefit of having it fully integrated into the language model. There's also softened censorship (Sam Altman hinted at this during an AMA a while ago). You can ask it to draw a hot woman or whatever tickles your fancy; the important thing is that it feels increasingly non-patronizing. That's good.
And then there are all the Studio Ghibli-like pictures flooding my feeds (this guy is the culprit). Miyazaki must be livid—or, if not, he’s surely lost whatever little faith in humanity he had left. As much as I'm enjoying the Ghiblification—it was cute for the first two million images I saw—I feel sad for Miyazaki. (Isn’t the whole point of Ghibli style’s appeal that it’s carefully and painstakingly created by human artists?). I’m also surprised by the virality. I hadn't seen my socials captured by a single theme since the ChatGPT screenshot frenzy of late 2022. We can conclude that, as Mr. Apples would say, “It's a good model, sir.” (Or so we will say until we start realizing this is but another incarnation of The Great Slopification.)
So, how did OpenAI do it? How did they solve image generation once and for all? The blog post, not giving away much detail as usual, says this:
We trained our models on the joint distribution of online images and text, learning not just how images relate to language, but how they relate to each other. Combined with aggressive post-training, the resulting model has surprising visual fluency, capable of generating images that are useful, consistent, and context-aware.
I’d guess that the key factors here are the text-image joint distribution and what OpenAI vaguely calls “aggressive post-training.” Native multimodality already existed, but not at this level of quality. Researchers at other labs are starting to put forward interesting hypotheses as they try to figure out exactly what’s going on.
Whatever the case, the consensus is clear: the world just witnessed a paradigm shift in image generation (one that will take time to make itself visible in economic growth and productivity metrics, but still).
However.
There’s a line in OpenAI’s post that you probably skimmed past. They mention it, almost in passing. But I think it’s the most important part—and I’m going to explain why it changes how we should be thinking about this release.