OpenAI Sora: You Can't Believe Your Eyes Anymore
From chaos we came and to chaos we shall return
I promised my analysis of Sora would be divided into two parts. One, that I published last week, about Sora, the AI model—what it is and what it can do—and a second one about the broader second-order cultural and social implications of the technology.
As I deep-dived into this second part I realized it should be further divided into two. Both of them deal with how video-generation AI (not just Sora but in general) affects our culture. But I approach them from opposite perspectives: the part you’re reading now is concerned with what we’ll lose as AI-generated videos become indistinguishable from human-made ones. The next, which I’ve yet to write, is about what we’ll gain.
Although this distinction appears to obey a negative-positive dichotomy, that’s not my intention. We can gain something we don’t want or need, which doesn’t make it better than losing something we held dear to us—or at least, something we took for granted whose loss we’ll have to grudgingly deal with.
Alternatively, out of fairness with a position I can’t support at this time but that I acknowledge some people do, I concede adaptation to seemingly unneeded changes is at least as powerful as resistance to unwanted ones. Worlds are built out of this paradox and most people don’t complain that much after the fact (or do you know anyone who, like Socrates did, condemns writing as a forgetfulness-enabling tool?). I’ll try my best to reflect this truth in my articles as well.
The decency of letting us know we’re going blind
OpenAI Sora is okay but not great.
If it was bad, this article would not matter. If it was sublime, it’d be too late.
We’re at the critical time window when, like a newborn cat, we must open our eyes or go blind forever. It’s the brief period between the moment we see the storm appearing on the horizon and the moment it hits us with all its power and wrath. That’s us, right in the middle.
OpenAI announced Sora but, unusually, didn’t release it. They wanted us to see the clouds so we could prepare for the rain (at least better than we did for ChatGPT).
As others have said, you can think of Sora as a GPT-3 for text-to-video—not there yet but a hint of what’s to come. After GPT-3 (2020) we got GPT-3.5, then ChatGPT, then GPT-4 in three years. Sora 2 (OpenAI isn’t very good at names) is coming much sooner than we’re prepared for (perhaps even more than that if we believe rumors that OpenAI had Sora ready since March 2023, which would be as impressive as terrifying).
The blow will hit us, true, but thanks to OpenAI’s heads-up it won’t take us off guard this time. We have a precedent. ChatGPT (or GPT-4) was the natural continuation of GPT-3. The scaling laws that predicted GPT-4 from GPT-3 seem to apply for text-to-video as they do for language models: larger computers, better data, and more parameters will eventually lead to performance breakthroughs.
We can expect the Sora → Sora 2 leap to be comparable to GPT-3 → ChatGPT. So let’s make the most of what we know to prepare for what’s coming.
Sora is not just your everyday deepfake maker
Sora is a tragedy in the making (unlike ChatGPT or DALL-E) because video has always been a powerful fact-checking information medium.
Text data was never trustworthy and images and pictures, although they were mostly believable for a while, can be trivially altered with software that far precedes AI (e.g. Photoshop). ChatGPT and DALL-E add a scale factor to that untrustworthiness, but we have long been wary of the media ecosystem they belong to (which they allegedly threaten). Videos, however, are (were?) a source of truth.
However, Sora isn’t the first AI tool that can alter, modify, or create fake videos. Deepfakes have been improving since they first came out in 2017, conquering audio, images, and video. The recent Taylor Swift fake porn incident and $25 million less in the bank account of some Hong Kong multinational, suggest Sora will be, at most, a new way of doing a similar kind of harm.
But, as OpenAI generously advises us, we should look forward; to the incoming storm. Sora is okay but Sora 2 will be much more powerful. A kind of improvement we wouldn’t get from linearly improving existing tech. Sora—the idea, the breakthrough—is not the same as common deepfake tech. It’s not the same kind of threat to video as a source of truth.
What makes Sora special, then, is its scope.
What appears to be a “small”, quantitative technical step (e.g. edit real videos to make a deepfake → generate deepfake videos with Sora), can entail a drastic qualitative jump once its effects are translated to the socio-cultural landscape.
I know the consequences will be drastic because it’s not the first time it’s happened. As I said, pictures were mostly believable for a while—once a trustable source of information comparable to video today—but we lost that.
Photography wasn’t reliable at first. At the time it was invented, we could only rely on reason and our senses to access the truths of the world. Cameras were originally thought of as an artist’s tool, not a device for reality capture. 19th-century photographers didn’t hesitate to change a detail or remove an object here and there, as long as it favored their artistic desires (or other, more obscure motives); if it helped them escape the “tyranny of the lens,” as Henry Peach Robinson called it.
Only a century later did photography become a trustable medium. But even then, in the naive pre-Photoshop era, most people conceded pictures a high, albeit at times undeserving, epistemic value. Photoshop and other editing software weren’t the first inventions to challenge the camera’s role as a source of truth, but the scope to which they could be threatening was remarkable. Suddenly, anyone could twist reality at will, eroding a means of truth-grounding we had taken for granted.
Sora is, for the first time in history, the same for video. A continuous change technologically speaking—one we could, if knowledgeable enough, see coming—but a sharp socio-cultural leap.
What will happen when we start to use text-to-video AI to create educational videos that have subtle but critical mistakes? What will happen when the deepfakes malicious actors create aren’t constrained to existing videos but unbounded in style, setting, and character—capable of generating seemingly real “counterfeit people”?
In an illuminating New Yorker essay, Daniel Immerwahr warns to not take comfort in having successfully neutralized earlier assaults at the information ecosystem:
It’s possible to take comfort from the long history of photographic manipulation, in an “It was ever thus” way. Today’s alarm pullers, however, insist that things are about to get worse. With A.I., a twenty-first-century Hoxha would not stop at awkwardly scrubbing individuals from the records; he could order up a documented reality à la carte.
Sora is not your everyday deepfake maker (hardly accessible except for the most tech-savvy people and constrained in scope), but a boundless reality-twisting tool soon to be in the hands of anyone.
The trust flip we didn’t expect—or want
Immerwahr goes on to say that perhaps we’re underestimating humans’ ability to not be deceived. I agree. Deepfakes, even high-quality ones or from-scratch deepfakes, like Sora’s, are not reality-bending in the way most people believe. Perhaps the “epistemic apocalypse” we’re terrified of is not such.
And why would it be? We evolved without a “trust first, check later” kind of natural mechanism. For most of our history—and I mean hundreds of thousands of years—knowing with such certainty what we can or can’t believe just wasn’t a thing. We had to “check first and trust later,” as Kevin Kelly writes in his recent essay “The Trust Flip.” Photography and video cameras were a fleeting—albeit welcome—detour from an eternal state of epistemic uncertainty.
As Kelly says, generative AI is merely returning trust to its rightful place, conditioned in our individual ability to ascribe it correctly:
The arrival of generative AI has flipped the polarity of truthfulness back to what it was in old times. Now when we see a photograph we assume it is fake, unless proven otherwise. When we see video, we assume it has been altered, generated, special effected, unless claimed otherwise. The new default for all images, including photographic ones, is that they are fiction – unless they expressly claim to be real.
This all sounds like a reasonable counter-argument to those who are afraid of a future riddled with uncertainty. Why be afraid when that was all we knew up until the 19th century?
To this, I say: “Fear might not be warranted but something else is; asking why?” Why would I want to get back to epistemically brittle pre-photography times just because someone found it worth it to make a fake-video-generating tool? Something in exchange should be provided to merit such an unnecessary concession.
Kelly doesn’t enter into judgments, but that’s the part that matters, right? It’s deeply undesirable to sacrifice the common good of having a shared ground truth for… the promise that the future will be better. Because that requires trust. And in an ironic twist of fate, my long-evolved trust-deserving people-detecting evolutionary mechanism tells me to not trust OpenAI.
This trust flip we didn’t expect or want is a degradation of the quality of the information ecosystem. I’m not sure anything on the gain side of the balance could compensate for this.
But I said I’d be fair so here’s the natural response to my argument—not really evidence-based but a history-doesn’t-repeat-but-it-rhymes kind of prediction: We adapt. We always do. And once we do, we realize the world is better off.
I won’t challenge this notion because I agree with it. Technology happens and, a few generations later, it’s not only taken for granted but often seen as an irreplaceable reality. How many times have you thought, “I couldn’t live without [insert literally anything].” Well, that thing, whatever it is, is the product of technological progress.
This is true just like it’s true the other side of the coin; technology is always a trade-off with the customs of the times it disrupts. While it happens, it’s hard to see the benefits of technology. After it’s happened, what becomes hard is, instead, to remember the part of life that was better before.
In the post-Sora world, here’s what we will have lost that will promptly forget: The calmness of being able to outsource our trust.
As Kelly says, we came from chaos—we evolved to thrive in a “check first and trust later” kind of world. We’ll do fine this time there, too. But losing that calmness and adapting to chaos will be a slow, painful process. The flip trust was never without a cost.
Having gotten out of that wild, misty, uncertain, and unforgiving world for a while, only to be forced to go back once again, is just not something I had in my “This is the future I want” bingo card.
"The calmness of being able to outsource our trust." - Should have been the title. I am going to have to re-read this when I have more time. So many gems.
I’ve heard more people call Sora the GPT3 moment for text-to-video. “We can expect the Sora → Sora 2 leap to be comparable to GPT-3 → ChatGPT”
What makes you assume that?