OpenAI Has Just Killed Prompt Engineering With DALL-E 3
You can now get high-quality images that depict complex scenes by default
Change of plans. I have the third part of the OpenAI series ready to publish but they just announced an unexpected release: DALL-E 3.
It’s significant for three reasons.
First, OpenAI has more than enough resources to combat Google and its other competitors on the language front while developing fantastic image models.
Second, the state of the art for image generation has taken another leap forward. If not quality-wise—we’ll have to test how well it stands against Midjourney—surely in the raw skill of the systems (i.e. what they can do at an arbitrary quality level).
Third, if OpenAI’s depiction of the model's capabilities is faithful, prompt engineering for AI art is done. I will comment a bit on the relevance of this one, but first, let's see what OpenAI says about its new release.
What DALL-E 3 can do
“DALL·E 3 understands significantly more nuance and detail than our previous systems, allowing you to easily translate your ideas into exceptionally accurate images.”
This has been one of the main shortcomings of models like Midjourney and Stable Diffusion (and the previous DALL-E versions, too); it was very hard to create a prompt that could accurately translate your mental image of a scene into the model. OpenAI seems to have solved just that with DALL-E 3.
Here's an example from the blog post:
“DALL·E 3 can accurately represent a scene with specific objects and the relationships between them.”
Neither Midjourney nor Stable Diffusion allow you to do this—solitary characters and objects are easy and the quality is high, but scenes where different objects have to follow specific relationships described in the prompt? That was an unsolved challenge.
Sam Altman predicted a while ago that prompt engineering was a temporary phase of generative AI. I agreed back then but argued that it could take a lot of time to get the models to the point where we wouldn't need to translate our ideas into a language they could understand. It seems that milestone, at least for image generation models, has been achieved.
This means that the entry barriers that somewhat “gatekept” the ability to create amazing images with AI are being demolished fast. Visual creativity is being democratized.
What this means for traditional artists and the creative community is a question we should discuss. On the one hand, it's great to be able to create great art without deep prompt engineering expertise (for instance the Stable Diffusion + ControlNet style currently trending on social media isn't straightforward at all), but on the other, I can't help but feel that we—we, humanity—are losing something every time we take a step in this direction.
What do you think?
Some other details about DALL-E 3
On top of that, text in images is no longer an issue (although other models had already solved that one), and ditto for six and seven-finger hands.
DALL-E 3 is in research preview but Plus and Enterprise users will have access in October. It will be later available for the rest in OpenAI Labs. As with DALL-E 2, the images are the property of the creator and can be printed and commercialized.
OpenAI has gone a step further to connect DALL-E 3 with ChatGPT so that the latter can act as a creative partner. Many people have already experimented with this, so it's not a new feature, but OpenAI has significantly reduced the friction in going from an idea to an image.
OpenAI has also taken an important step to protect the livelihood of living artists, which is an important step to find common ground with them (and avoid future lawsuits): DALL·E 3 will decline requests to copy their styles (which is arguably the strongest criticism from the creative community) and they will be able to “opt their images out from training of our future image generation models.”
There you have it, DALL-E 3—a nice surprise for the middle of the week.
This genuinely feels like a major paradigm shift. Even if the image quality isn't quite at the same level as Midjourney, opening the flood gates to any average person being able to conjure up whatever they can think of is massive.
I'm also happy to see that OpenAI are taking steps to address some of the ethical issues with having these models trained on the work of artists who aren't compensated. (It's in the "Creative Control" section of the "A focus on safety" chapter. They now claim ChatGPT will refuse to generate images in the style of a living artist. And they let artists proactively opt out of models training on their work. Whether this goes far enough is of course another discussion.
But I'm probably not as fatalistic as you seem to be about us losing humanity when it comes to this specific development. I see many wonderful use cases where passionate authors can play around with visualizing the scenes they describe in an extremely nuanced way. I see kids exploring magical worlds of their imagination (the DALLE-3 demo video about the hedgehog is along those lines). I see the average person no longer constrained by their technical / artistic ability giving outlet to amazing creations lying dormant in their minds. And so on.
Sure, as with most generative AI, we'll see battles over ethics, copyright, etc. and we'll want some regulations in place to prevent the worst abuses. But I definitely see the potential for this to be huge booster of creativity, where AI and people work in tandem to create something new.
I doubt anyone outside of OpenAI knows the answer to this, but what’s your take on the phrase “available to GPT Plus consumers”? Does that mean included for their subscription price or does it mean the ability to add on this feature for a surcharge?