You Must See How Far AI Video Has Come

The implications go beyond disinformation, AI, and even technology

Dec 18, 2024

A cat roars while looking at its reflection in the mirror but instead sees itself as a lion roaring

The Gist

I. Google DeepMind's Veo 2 is leagues ahead: Google’s Veo 2 obliterates the competition in AI video generation, delivering unmatched quality, physical consistency, and prompt adherence—it’s not a slot machine; it’s an ATM. Compared to Veo 2’s seamless outputs, OpenAI’s Sora Turbo feels rushed.
II. Win something, lose something: Deepfakes blur reality, not through deception but as viral expressions, like the Pope in a Balenciaga coat. “We won the ability to make fantasy into reality. We lost the ability to tell them apart,” trading trust in visuals for boundless creativity.
III. We must remember what was to judge what is: As progress accelerates, we risk losing more than we gain. “Traditions, customs, practices… are slipping away too fast,” leaving older generations disconnected from the young. Thoughtful progress is vital to avoid drowning in speed.

I. Google DeepMind’s Veo 2 is leagues ahead

Last week I wrote an overview of Sora Turbo, OpenAI’s new AI video model. Suffice it to say that I find the metaphor of a “very expensive slot machine” humiliatingly fitting.

OpenAI has mastered the marketing games and enjoys a polished development-deployment pipeline, but they rushed Sora. It’s terrible. Open-minded artists call it unusable. Comparing it to a slot machine is doing slot machines a disservice—sometimes they give you what you want. Sora's lack of physical consistency and prompt adherence make it miss every time.

Just look at this, a crude but faithful representation of its ability:

I restrained myself, however, from extrapolating OpenAI’s failure to the entire AI video generation space. Are other companies struggling this much? Is the technology just too hard? Meta Movie Gen seems ok but they’re not giving people access and we know the one skill all AI companies really dominate is cherry-picking demos.

Kling 1.5 and Tencent’s Hunyuan are decent. Hailuo’s Minimax as well. These are also better than Sora. And they’re readily available, which suggests OpenAI might have lost its secret sauce. But let’s be real: These barely known AI models won’t face OpenAI’s great challenge: How to provide 300 million users access while keeping cost and latency low enough to prevent the model from being a cash drain or annoyingly slow.

OpenAI is often the first and the best at what it does, but not always. Sometimes they have to choose between first and best. With Sora they chose none. AI video is the branch of generative AI where they’re furthest from the top. But even the others I mentioned—Meta, Tencent, Kling; yet to face distribution problems—aren’t much better quality-wise.

Neither has solved the complexity of the physical world to the point where the finely attuned human eye isn’t unsettled by something that feels off. And, when it comes to video generation, that’s the only benchmark that matters: Can it fool me?

Google DeepMind Veo 2 does. And so reliably that the metaphor needs an upgrade. Veo 2 is not a slot machine. It’s an ATM.

Just look at it (sources: 1, 2, 3, 4).

Prompt: “A cat roars while looking at its reflection in the mirror but instead sees itself as a lion roaring.”

Prompt: Not available.

Prompt: “Bear writing the solution to 2x-1=0. But only the solution!”

Prompt: Not available.

The quality, physical consistency, and prompt adherence of these videos reveal two takeaways. First, Google destroys the competition in AI video generation by a substantial margin. Here are Veo 2’s human preference benchmarks, compared to the second-to-best models:

See how they compare for yourself (sources: 1a, 1b, 2a, 2b, 3, 4).

Prompt: “A bartender making an old-fashioned cocktail.”

Veo 2:

Sora Turbo:

Prompt: “A pair of hands skillfully slicing a ripe tomato on a wooden cutting board.”

Veo 2:

Sora Turbo:

Here’s Veo 2 and Sora side by side:

Here are all the main AI video models (the video goes over them one by one):

Google is so far ahead of the rest, that it hurts to see them try.

The second takeaway—and the focus for the rest of this article—should be clear to you, dear reader: keeping up with society-wide technological paradigm shifts is impossible for those who remain unaware.

REMINDER: The Christmas Special offer—20% off for life—runs from Dec 1st to Jan 1st. Lock in your annual subscription now for just $40/year (or the price of a cup of coffee a month). Starting Jan 1st, The Algorithmic Bridge will move to $10/month or $100/year (existing paid subs retain their current rates). If you’ve been thinking about upgrading, now’s the time.

Get 20% off forever

Why not make it a gift? Share evergreen knowledge with that special person who will appreciate the tools to navigate a world most people don’t yet understand and won’t see coming. Schedule now, until December 25th.

GIFT 20% off forever

II. Win something, lose something

Deepfakes—audiovisual information that’s not real but appears to be, created with modern AI techniques—have conquered every medium. I was prepared for AI to strip away our fragile grasp on reality, but I wasn’t ready to face it so soon. Leading labs have speedrun the text → image → audio → video pipeline in barely four years.

The hidden implications go beyond misinformation and propaganda. They go beyond AI and technology. The pillars that support our civilization—truth and trust—are being shaken down to every corner. When perfect counterfeit avatars of Donald Trump, Vladimir Putin, and Xi Jinping go on TV to declare World War III, you won’t know if you should run for your life or curse the day the AI industry gave this power to any stupid kid with a smartphone.

Despite the extreme scenarios that we may come up with to justify our fear of deepfakes, I hold the unpopular opinion that their real power is not deception but that they’re easily weaponized as memetically viral vectors of expression.

Pope Francis in a white puffer jacket in an image generated by artificial intelligence.

No one cared if the Pope actually dressed in a white Balenciaga coat. The motivation behind the image was orthogonal to the underlying truth. It wasn’t intended to be deceitful but funny and surprising—the perfect mix to convey a message: The Pope can be cool, too, even if you still know he is not dressing like that. The seed of this idea—popes can be cool—is living free rent in your head, even if you reject its veracity.

The effects of this shouldn’t be overlooked. AI video generators can bring our wildest ideas to life through the most information-packed and attention-grabbing medium we have: video. Beyond being believable, video is inherently captivating. This isn’t so much a tool for liars as it is for charismatic leaders. It seldom changes minds—it ignites your soul and awakens ideas that lie dormant within.

So where do Google’s AI video capabilities leave us? What do we do now?

I’m not worried about misinformation, whose effectiveness bottleneck lies in the channels of distribution (social media) rather than the engines of creation (AI). I’m more worried—or perhaps just sad—about having lost something precious, the privilege of believing what we see, replaced by the power to render any fantasy as real. I hinted at this in ”You can’t believe your eyes anymore.”

Will we adapt to this? Yes, we always do. Like we did with cameras and Photoshop. This doesn’t mean we’ll figure out how to spot AI videos from those that are human-made (even if we read the definitive handbook). It means that we’ll accept that we’re losing the intrinsic reliability of our visual system as a means to reflect the world out there. Here’s something I wrote in February:

In the post-[Veo 2] world, here’s what we will have lost that will promptly forget: The calmness of being able to outsource our trust.
As [Kevin] Kelly says, we came from chaos—we evolved to thrive in a “check first and trust later” kind of world. We’ll do fine this time there, too. But losing that calmness and adapting to chaos will be a slow, painful process. The flip trust was never without a cost.

We won the ability to make fantasy into reality. We lost the ability to tell them apart.

III. We must remember what was to judge what is

Let me go up a level of philosophical abstraction here for the curious minds who want to go further into the implications of these advances.

The history of technology teaches us two key lessons: first, we always outgrow our anxieties about novelty; second, we inevitably forget those anxieties ever existed. Having accepted these quirks of our otherwise reason-plagued species, I now find myself pondering a different question each time an innovation disrupts the status quo: Will what we gain ultimately outweigh what we lose?

I can’t, of course, possibly respond to this question. But I don’t care about the answer. The important thing is that I’m asking the question. Because we rarely do. Instead, we either slip into the new reality unwittingly, moving from anger to acceptance as we outgrow our anxieties, or we embrace shortsighted optimism: “Whatever comes next must be better than what we had. After all, when I look back, I don’t remember mourning the things we lost.”

The history of technology teaches us two key lessons but neglects to warn us about the one blind spot they create. Asking that question allows me to fix it: let’s not erase our memories of what was so that we can faithfully judge what is.

If I were to answer that unanswerable question, I’d say that we’ve taken enough wrong turns in recent decades—probably more than a few, but, you know, my memory isn’t that good—to leave me doubtful about answering positively. Perhaps we’re posited to reach some tipping point after which we will start to lose more than we gain.

I won’t name names, you know what stuff I’m talking about. Your bag of “things that went wrong” may not perfectly overlap with mine. It doesn’t matter because it’s not the bag’s size that’s making me doubtful, but the speed at which it grows. We’re putting items in way too fast. Traditions, customs, practices, and shared commons are slipping away too fast. This is but a third history lesson waiting to be taught in the schools of the future: there’s no anxiety like going at the speed of light. Marshall McLuhan warned about it 60 years ago. We didn’t listen.

For the first time, old people barely relate to young people. Adults no longer say: “Ah, they’re just kids, we did the same at their age.” Because we didn’t. Whatever it is that kids are doing these days—Tiktok? Roblox? I don’t know—it didn’t exist 20 or 30 years ago. Teachers no longer say: “It’s normal if some students struggle a little.” Because almost every student is struggling. We no longer see ourselves when we look at our children. I don’t think that’s something we should be proud of having gained.

Progress is a great thing, but humans haven’t evolved to withstand this kind of fast-paced transformation. Those who say “all progress is good and we should be doing it faster” are coping with their inability to be happy today. So they’re forced to escape forward, the sooner the better, to either the singularity or ad nihilo.

I’m against degrowth as a means to fix this. I also don’t think a sweeping slowdown is the solution. But we should be conscientious about what we have, where we’re headed, and how fast we’re going. And ask ourselves, honestly, if we could do better by choosing our battles and leaving alone those in a state of eternal zugwang.

As I judge the present moment, and weigh what we are to gain against what we are to lose, I sense that we are approaching the tipping point.

James

Dec 18

i suspect that the vast difference between sora and veo2 is from video library used for pretraining. google's unfettered access to youtube probably has much to do with it. also given that openai now should have access to iphone cameras that their library should grow by leaps and bounds and significantly improve sora. this means that physics is wrapped up in pre- and post- training on video content. on a different note, now that people are desensitized to ai videos on social media, they will become more skeptical and enthralled with what they see and hopefully disengage. this should be very damaging to tictok and instragram.

Expand full comment

1 reply by Alberto Romero

Jason Baldridge

Dec 20

Thanks for the thoughtful write up! I’m on the team that built Imagen 3 and Veo, and this new Veo 2 model is the most exciting new model I’ve had the opportunity to explore and evaluate since we built the Parti image generation model a couple years back.

An important component of our release of these models is that every image and video is tagged with SynthID so that they can be verified as AI generated.

https://deepmind.google/technologies/synthid/

We are also part of C2PA, a consortium that adds metadata to generated content.

https://c2pa.org/

These are part of a broader approach to how the benefits of these technologies can be brought to the world while mitigating some of the risks you mention.

18 more comments...

The Algorithmic Bridge