OpenAI Sora: One Step Away From The Matrix
The best text-to-video AI model is also a... world simulator?
Yesterday, OpenAI announced the most important AI model yet in 2024: Sora, a state-of-the-art (SOTA) text-to-video model that can generate high-quality, high-fidelity 1-minute videos with different aspect ratios and resolutions. Calling it SOTA is an understatement; Sora is miles ahead of anything else in the space. It’s general, scalable, and it’s also… a world simulator?
Quick digression: Sorry, Google, Gemini 1.5 was the most important release yesterday—and perhaps of 2024—but OpenAI didn’t want to give you a single ounce of protagonism (if Jimmy Apples is to be believed, OpenAI had Sora ready since March—what?—which would explain why they manage to be so timely in disrupting competitors’ PR moves). I’ll do a write-up about Gemini 1.5 anyway because although it went under the radar, we shouldn’t ignore a 10M-token context window breakthrough.
Back to Sora. This two-part article is intended for those of you who know nothing about this AI model. It’s also for those of you who watched the cascade of generated videos that flooded the X timeline but didn’t bother to read the post or the report.
In the first part (this one), I review the model and the “technical” report (it deserves to be in quotes) at a high level (will avoid jargon for the most part) and will interleave through the text the best examples I’ve compiled and some insightful comments and hypotheses I’ve read about how Sora was trained and what we can expect in future releases.
Before you ask, OpenAI isn’t releasing Sora at this time (not even as a low-key research preview). The model is going through red-teaming and safety checks. OpenAI wants to gather feedback from “policymakers, educators and artists around the world.” They’re also working on a detection classifier to recognize Sora-made videos and on ways to prevent misinformation.
In the second part (hopefully soon), I’ll share reflections about where I think we’re going both technologically and culturally (there’s optimism but also pessimism). I hope you enjoy this first part because the second one, well, is not for amusement—which is appropriate given that soon everything will be.
Sora is a text-to-video model
Sora is a high-quality text-to-video model (compared to the competence), which is impressive in itself.
Here are my top three favorite examples (all videos are from either the blog post or the technical report except specified otherwise). I like the colors of the first, the second is plain incredible—hard to believe it’s not real—and the third one has too much swag:
But Sora is also more than that. It can animate images into videos beyond zoom-out extrapolation and other simple techniques, guided by text prompts:
It can create new videos from other videos by adding scenes, creating loops, extending duration, and even interpolation, like this drone-butterfly scene (other examples):
And despite being a video model, it can create high-quality images from text (like DALL-E and Midjourney, arguably better than both) Adherence to the prompt is very high thanks to an internal process of recaptioning (already present in DALL-E 3 but extended to videos):
Sora does all this—especially the video-related generations—much better than any competitor (just look at Google Lumiere). Here’s an example of the happy-cat video. From Sora (below) and from Pika AI, Runway, Leonardo, and FinalFrame.
Sora is a diffusion transformer
Sora combines a diffusion model (DALL-E 3) with a transformer architecture (ChatGPT). The mix allows the model to process videos (which are temporal sequences of image frames) like ChatGPT processes text.
In particular, OpenAI has taken inspiration from DeepMind’s work on vision transformers to “represent videos and images as collections of smaller units of data called [spacetime] patches, each of which is akin to a token in GPT.” Here’s a high-level visualization from the report:
As I said above, the technical report deserves to be put in quotes because it’s very scarce on details for either replicating the work or understanding it deeply. We know very little about the exact architecture except that it’s a diffusion transformer, and little about the training data except that it’s captioned videos.
One hypothesis I’ve seen people support is that at least part of the training data comes from Unreal Engine 5 (metahumans, matrix demo), or other 3D engines (as the particularities of the artifacts reveal). NerF data is another hypothesis. Probably there’s a mix of things we will never know about.
Sora is a generalist, scalable model of visual data
Not only can Sora make images and videos from text, or transform images and videos into other videos, it can do it in a generalized, scalable way, unlike competitors.
For instance, Sora “can create multiple shots within a single generated video that accurately persist characters and visual style.” It can make videos up to a 1-minute in duration but you can also make them as short as you like. You can make vertical, square, and horizontal videos with different resolutions. From the report: “Sora can sample widescreen 1920x1080p videos, vertical 1080x1920 videos and everything inbetween.” Here’s an example.
Besides versatility, Sora appears to follow scaling laws that mirror those of language models. Quality improves substantially just by adding compute thanks to the characteristics of the transformer architecture. Here’s an example.
This generalized, scalable nature is what incites people to make predictions about AI disrupting Hollywood and movie-making in general. Given the pace of progress, it’s not crazy to imagine in a few months an AI model being able to create multi-scene, multi-character complex videos up to 5 or 10 minutes long.
Do you remember Will Smith eating spaghetti? That was one year ago.
Sora is a (primitive) world simulator
This is the news that has excited (worried?) me the most.
First, here’s a recap. Sora is a text-to-video model. Fine, it’s better than the rest but this technology already existed. Sora is a diffusion transformer. Likewise, OpenAI hasn’t invented the mix albeit they added interesting custom ingredients. Sora is a general and scalable visual model. Things begin to get interesting here. Possibilities open up for future research and surprise is warranted.
But, above all else, Sora is an AI model that can create physically sound scenes with believable real-world interactions. Sora is a world simulator. A primitive one, for sure (it fails, sometimes so badly that it’s better to call it “dream physics”) but the first of a kind.
OpenAI says Sora not only understands style, scenery, character, objects and concepts present in the prompt, etc., but also “how those things exist in the physical world.” I want to qualify this claim by saying that Sora’s eerie failures reveal that, although it might have learned an implicit set of physical rules that inform the video generation process, this isn’t a robust ability (OpenAI admits so much). But surely it’s a first step in that direction.
More from OpenAI on Sora as a world simulator (edited for clarity):
[Sora can] simulate some aspects of people, animals and environments from the physical world. These properties emerge without any explicit inductive biases for 3D, objects, etc.—they are purely phenomena of scale.
Simulation capabilities:
3D consistency
Long-range coherence and object permanence (e.g. our model can persist people, animals and objects even when they are occluded or leave the frame)
Interacting with the world (e.g. a painter can leave new strokes along a canvas that persist over time)
Simulating digital worlds (e.g. Minecraft)
I like Jim Fan’s take on this (and his breakdown of the pirate ship fight video):
Sora is an end-to-end, diffusion transformer model. It inputs text/image and outputs video pixels directly. Sora learns a physics engine implicitly in the neural parameters by gradient descent through massive amounts of videos. Sora is a learnable simulator, or “world model”.
Of course it does not call UE5 [Unreal Engine 5] explicitly in the loop, but it's possible that UE5-generated (text, video) pairs are added as synthetic data to the training set.
OpenAI concluded the blog post with this sentence:
Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI.
So I will conclude this first part with two questions for you:
How far are we from The Matrix?
Do we really want to go there?
It seems democracy isn't prepared for this...
Amazingly understandable post on a really complex and recent subject. Thanks!