Stable Diffusion 2 Is Not What Users Expected—Or Wanted
The AI community doesn't like it at all—but, are they right?
I don't usually cover model announcements in TAB’s main articles (I do so in my Sunday column, “What You May Have Missed”). I do exceptions for transcendental downstream implications (like with Galactica and BLOOM) or high interestingness/usefulness.
Today’s topic probably checks both: Stability.ai, the king of open-source generative AI, has announced Stable Diffusion 2.
The new version of Stable Diffusion brings key improvements and updates. In a different world, it’d be highly likely that every app/feature/program that uses Stable Diffusion would use the new version right away.
However, that’s not going to happen. Stable Diffusion 2, despite its superior technical quality, is considered by many users (if not all) a step back.
In this article, I'm going to describe—as simply as possible—the main features of Stable Diffusion 2, how it compares to 1.x versions, why people think it's a regression, and my take on all this.
To be clear, this isn't just about Stable Diffusion 2. What's happening goes beyond Stability.ai—it’s a sign of what’s coming and how generative AI is about to clash with the real world.
Stable Diffusion 2: models and features
Let’s begin with the objective part of the story.
This section is slightly technical (although not difficult), so feel free to skim through it (still worth reading if you plan to use the model).
Stable Diffusion 2 is the generic name of an entire family of models that stem from a common baseline: Stable Diffusion 2.0-base (SD 2.0-base) a raw text-to-image model.
The baseline model is trained on an aesthetic subset of the open dataset LAION-5B (keep this in mind, will be important later) and generates 512x512 images.
On top of SD 2.0-base, Stability.ai trained a few more models with specific features (examples below).
SD 2.0-v is also a text-to-image model but defaults to a better resolution (768x768):
Depth2img is a depth-to-image model that builds on the classic img2img version to improve the models’ ability to preserve structure and coherence:
The upscaler model takes the outputs from the others and enhances the resolution 4x (e.g. from 512x512 to 2048x2048):
Finally, a text-guided inpainting model provides the tools to semantically replace parts of the original image (as you can do with DALL·E):
To facilitate portability for existing users, Stability.ai optimized the models to run on a single GPU. As they explain in the blog post: “we wanted to make it accessible to as many people as possible from the very start.”
Like Stable diffusion 1.x, the new version falls under permissive licenses. The code is MIT licensed (on GitHub) and the weights (on Hugging Face) follow a CreativeML Open RAIL++-M License.
Stability.ai is also releasing the models on the API platform (for developers) and DreamStudio (for users).
The single most relevant change to SD 2: The OpenCLIP encoder
Now to the more consequential news.
Stable Diffusion 2 is architecturally better—but similar—to its predecessor. No surprises there. However, Stability.ai has drastically changed the nature of one particular component: the text/image encoder (the inner model that transforms text-image pairs into vectors).
All publicly available text-to-image models—including DALL·E and Midjourney—use OpenAI’s CLIP as encoder.
It’s not an exaggeration to say that CLIP is the most influential model in the 2022 wave of generative AI. Without OpenAI or CLIP, it wouldn’t have taken place at all.
That puts into perspective Stability.ai’s decision—breaking a two-year standard practice—to replace OpenAI’s CLIP on Stable Diffusion 2 with a new encoder.
LAION, with support from Stability.ai, has trained OpenCLIP-ViT/H (OpenCLIP), which reportedly sets a new state-of-the-art performance: “[it] greatly improves the quality of the generated images compared to earlier V1 releases.”
Stable Diffusion 2 is the first—and only—model to integrate OpenCLIP instead of CLIP.
Why is this noteworthy? Because OpenCLIP isn’t just open-source, like the original CLIP—it was trained on a public dataset (LAION-5B).
As Emad Mostaque (Stability.ai CEO) explains, “[CLIP] was great, but nobody knew what was in it.”
The fact that OpenCLIP is trained on a publicly available dataset is significant (although not necessarily good) because now devs and users can know what it encodes (i.e. what it has learned and how).
This has two immediate implications.
First, because OpenCLIP's and CLIP's training data are different, the things Stable Diffusion 2 “knows” are not the same as what Stable Diffusion 1, DALL·E, and Midjourney “know”.
Mostaque explains that the prompt techniques and heuristics that worked for earlier versions of Stable Diffusion, may not work equally well for the new models: “[Stable Diffusion] V2 prompts different and will take a while for folk to get used to.”
However, even if Stable Diffusion 2 has learned things differently—and it’ll force users to rethink their prompt skills—it has learned them better, he explains (I'd say users have the final word here).
Second, because now we can find out exactly whose work is present in the dataset, Stability.ai could implement opt-in/opt-out features for artists in future versions (I don’t know if the company will do this, but Mostaque himself acknowledged this as an issue).
This means Stable Diffusion 2 is more respectful of artists' work present in the training data. A notable improvement from Midjourney and DALL·E.
Why the AI community is angry
But, if we dig deeper, we find a very different view of this.
As it turns out, Stability.ai trained OpenCLIP (and the models) on a different subset of LAION than users would’ve wanted.
They removed most NSFW content, celebrity images, and, what made people angry the most, they eliminated famous (modern) artists’ names from the labels completely (not their work, though).
This has serious (although not necessarily bad) second-order implications for Stable Diffusion 2 and the field of generative AI at large.
On the one hand, Stability.ai is clearly trying to comply with copyright laws by reducing its legally dubious practices, i.e. scraping from the internet the work of living artists to train their models without attribution, consent, or retribution.
On the other hand, Stable Diffusion users are reasonably pissed off because much of what they could generate before with the only high-quality open-source model that exists (Stable Diffusion) is impossible now.
Mostaque said prompts work differently, but the new implicit restrictions won't be solved with better prompt engineering.
For instance, you can no longer prompt “in the style of Greg Rutkowski” and get an epic medieval scene with magic and dragons, because Stable Diffusion 2 no longer recognizes “Greg Rutkowski” to be anything in particular.
That’s gone. And with him, every other living—or late—artist you were using. Their artworks are still present in the data but the encoder no longer can associate the images with the names.
I acknowledge Stable Diffusion 2 is objectively much more limited than its previous iteration in its ability to make art (Midjourney v4 is much better quality-wise for instance).
Can the AI community bypass these limitations by tweaking OpenCLIP? Although Mostaque suggested this possibility on the Discord server, it’s not clear how they could do that (in the end, it’s Stability.ai that has 5408 A100s), and fine-tuning the encoder is costly.
A regression for generative AI?
However, despite the ubiquitous disappointment among users, Stability.ai had a good reason to do this—if you live in society, you have to adapt to the boundaries society sets.
You shouldn't simply stomp on others (artists whose work is on the data feel that way) just because technology allows you to. And if you say that's what freedom means, let me tell you that, from that perspective, today's “freedom” is tomorrow's peril.
Regulation evolves slower than technology, true, but it eventually catches up. Arguing that “the genie is out of the bottle” or “progress is unstoppable” isn’t going to suffice when the law is set.
Right now, there’s a lawsuit ongoing against Microsoft, GitHub, and OpenAI for scraping the web to train Copilot (Codex). If it ends up favoring open-source devs, it could radically redefine the generative AI landscape.
What Stability.ai did to artists is no different than what those companies did to coders. They took, without permission, the work of thousands of people to create AI technology that anyone can now use to generate copycats of the artists’ creations.
That's most likely why the company has done this. They’re taking measures to avoid potential lawsuits (it's hard to argue they're protecting artists because if that were the case they'd have done this since the beginning).
But, regardless of their motives, the end result is what matters: AI people have their tech, and artists are more protected.
If the AI community now claims that Stable Diffusion is worthless because “in the style of…” kinds of prompts don’t work (even if the artist’s work is still present in the data) may be the only reasonable conclusion is that artists were right all along: their explicit presence in the data was bearing most of the weight to create great AI art.
Final thoughts
As I argued a few months ago, we should have open-minded and respectful conversations about this.
Sadly—and expectedly—it hasn’t happened. AI people have largely dismissed artists’ complaints and petitions. And artists, in most cases, weren’t open to adapting to new developments and sometimes even turned aggressive toward the AI community.
None of that is helpful.
I went into the r/StableDiffusion subreddit to get a sense of the general sentiment and it matches what I’m telling you here. The AI community is seriously at odds with Stability.ai’s decisions.
Calling Stable Diffusion 2 “a step back” and “a regression” are the softest comments.
Only one comment captured what I thought reading all that anger and frustration:
“Clearly no one here thinks that copying an artist work without permission is wrong. i find all messages to suggest that copying the style of people is somehow a step back. I am no artist, but just imagine that someone copies your work, using a tool developed by someone, and leaves you unemployed, your work being undoubtedly unique. Would this be something anyone considers fairly?”
I think it’s paramount to consider “the other side” (whether you're an artist, an AI user or both) when thinking about Stable Diffusion 2 in particular and generative AI in general.
Users are mad at Stability.ai now—reasonably in a sense and unreasonably in others—but they shouldn’t forget that when the regulation takes place—and it will—also Midjourney and OpenAI (and Microsoft and Google) will have to adapt and comply.
This goes way beyond any particular company. It's a matter of the world readapting to new technologies without losing sight of the rights of people (as a side note, I may not agree with the specifics of AI regulation, but I strongly believe regulation shouldn’t be nonexistent).
This non-accountability gap that generative AI companies and users have been enjoying (some may call it freedom) is coming to an end.
And, in my opinion, it's better this way.
Regarding regulation, let's be clear that regulation only applies to those willing to follow the law. We should never assume that regulation can make AI safe. Laws in general are kind of like the lock on your front door. The lock keeps your nosy neighbors out of your house, but it's worthless against anyone willing to break a window.
Great writeup, but I should note that Midjourney is 10000% training on top tier artists, and to stunning results.
If "Respecting artists" is Stability's main motive, then we must ask: Why isn't Midjourney pressured to do the same?
I see several reasons.
1. Midjourney is completely dependent on subscriber income, so it has to please its paying retail consumers. It does not take outside investment from my observations. Eliminating the ability to do high quality art will be a deathblow for the business, so they'd rather take the legal risk.
Stability on the other hand, is currently running on VC money, and in the future, offering model-customization services to other companies. Therefore it has to please investors and other reputation conscious businesses, so much more averse to legal and reputational concerns.
2. Midjourney is closed source AND heavily community moderated, making it much less of a target to regulators. Stability being open sourced makes it a terror to politicians, who fear such a model being flinged around for infinite deepfakes, with no way to 'cease-and-desist' them, so therefore they must target stability directly.
Incidentally, Midjourney can implement prompt level filtering for NSFW content, therefore it feels free to train on NSFW data. Stability being open sourced cannot possibly moderate the prompts, so has to do training-level filtering, which much worse impacts on end-image quality.
The drama of SD2.0, is not merely about whether artist data should be included or not. But also about the future of open vs closed source models dominating the market. The previous hope is that everyone can get free access to good open source models that can compete with closed source ones.
Now, it appears that closed source business models will dominate, because less sensitive to regulatory pressure and censorship.
Emad states that SD2.0 will serve as a clean base for fine-tuning the model for more specific uses (Adding back art and NSFW), but that's an expensive training process that only companies can afford. NovelAI is the most famous example of SD finetuning, they are closed source and charge subscriptions to access the model (Their version 1 model got leaked, but their version 2 won't)