27 Comments

This genuinely feels like a major paradigm shift. Even if the image quality isn't quite at the same level as Midjourney, opening the flood gates to any average person being able to conjure up whatever they can think of is massive.

I'm also happy to see that OpenAI are taking steps to address some of the ethical issues with having these models trained on the work of artists who aren't compensated. (It's in the "Creative Control" section of the "A focus on safety" chapter. They now claim ChatGPT will refuse to generate images in the style of a living artist. And they let artists proactively opt out of models training on their work. Whether this goes far enough is of course another discussion.

But I'm probably not as fatalistic as you seem to be about us losing humanity when it comes to this specific development. I see many wonderful use cases where passionate authors can play around with visualizing the scenes they describe in an extremely nuanced way. I see kids exploring magical worlds of their imagination (the DALLE-3 demo video about the hedgehog is along those lines). I see the average person no longer constrained by their technical / artistic ability giving outlet to amazing creations lying dormant in their minds. And so on.

Sure, as with most generative AI, we'll see battles over ethics, copyright, etc. and we'll want some regulations in place to prevent the worst abuses. But I definitely see the potential for this to be huge booster of creativity, where AI and people work in tandem to create something new.

Expand full comment

As the resident Doomerist :-) here's a perspective on "losing humanity" issue.

You correctly point to "wonderful use cases". Ok, so where do such wonderful uses lead, what comes next, where is this headed?

We can imagine this kind of image control precision making it's way in to video at some point. From there such control may leap off the screen in to 3D space, much as we see Apple trying to do with it's Vision Pro system. Over time the imagery in 3D space becomes more realistic, more controllable, more immersive, more interactive etc. That is, we're on a path towards the creation of ever more compelling imaginary realities.

Should I obtain the ability to create compelling imaginary 3D realities to my own personal preference, the question would seem to become.... What need do I have of you?

The characters in the imaginary reality give me exactly whatever I want, all day long, every day. They're always willing, they never get tired, they are my happy slaves.

With you my fellow human I have to negotiate and compromise, and even then I rarely if ever get exactly what I want, but rather some fraction of what I want. You're not always willing, you do get tired, and you certainly have no interest in being my happy slave.

We're celebrating each new step in this direction, as we march steadily towards a future where we will increasingly lose interest in each other. As I've written here and elsewhere a number of times, this process of dehumanizing detachment is already well underway, as so many of us (me included) choose to spend so much of our time with disembodied strangers on the Net instead of face to face flesh and blood humans in the real world.

Today I choose you disembodied strangers over my old friends because you will feed my interest in AI, and they will not. Tomorrow I'll choose the machines over you, because the machines will do whatever I tell them, and you will not.

Power corrupts, and absolute power corrupts absolutely.

Expand full comment

So well said Phil. So well said.

Expand full comment

Thank you. But too wordy, as usual. So we could try this...

What bonds human beings together is mutual need.

Expand full comment

Phil "Gloom-n-Doom" Tanny strikes again!

I think that's a fair alternative take in terms of where the society might broadly be headed.

I'm not convinced that "Having a text-to-image model that understand words better" is necessarily the most blatant example of it though.

We already have the technology for people who might wish to isolate themselves from the rest of humanity and live inside a sick fantasy world. Deepfakes, text-to-image models without NSFW filters, VR, etc. So there's plenty to satisfy any "happy slave" tendencies already.

But I'm not as cynical about humanity as a whole. There'll always be those who use technology to satisfy base urges (Stable Diffusion porn, ChatGPT scams and spam, etc.) , and there will be others who use it as an outlet for creativity and finding a sense of belonging (I encourage you to visit some of the Midjourney forums and hear stories of people sharing the power of this tech to heal and create genuine connections with others).

Technology in itself is neutral. It's us who imbue it with either negative or positive traits, depending on who we are.

If I'm to buy the argument that all of us are slowly heading to a future where we'll want nothing to do with each other and live in make-believe worlds of our own that satisfy our most depraved urges, then we're doomed either way. AI or not.

Expand full comment

Hi Daniel!

Your honor, speaking in defense of the Gloomy Doomerist, I'm only gloomy, negative, downer etc to the degree my analysis is inaccurate. And that degree is unknown at this time, given that none of us can know with any certainty what the future will bring.

I agree that a search for a sense of belonging is likely an eternal human desire. The question may be, to belong with what? Our fellow humans are not the only thing we can bond with.

Today we choose our friends based on how well they meet our needs. People who are inconvenient to our needs tend to get left behind, and those that best meet our needs are more closely embraced. Point being, we don't bond with people just because they are human, but because of how well they serve our interests. All I'm doing is projecting this well established principle in to a future when there are additional sources of satisfaction competing for our attention.

Isolating oneself from other humans is not automatically "living in a sick fantasy world". As example, I'll be spending almost all of the next six months in the North Florida woods, by myself, and I won't miss you guys at all :-). Over the last 20+ years I've learned how to bond with the woods, and it feels way healthier than being on the Internet. It's the bonding itself that matters more than what one bonds with.

Point being, some people will use the digital fantasy machine to have uplifting positive experiences. But they may be experiences that are less human-centric than is commonly held to be "normal".

Are we doomed? The answer to such a question may depend on at what level we are asking. We're all going to die, so at that level we are doomed. Every human civilization ever created has eventually collapsed, so we are probably doomed at that level too.

There is yet another level to consider. Where does being a human alive on this Earth for a period of time fit in to the much larger picture of reality? We know nothing of where we came from, or where we are going. This is not really the place for that conversation, but if it were, you would find me way more optimistic and positive than the typical group consensus assumption that death=doom.

Expand full comment

If we look at it from that perspective, then one can argue that there's nothing fundamentally different between enjoying alone time in the woods and enjoying a virtual life inside a fantasy world of your own creation. As long as both give you an uplifting experience that doesn't negatively affect others.

Personally, I'd choose the woods. But if someone prefers the alternative? I'm just happy that AI might give them that option.

Have a fantastic time in North Florida. It actually sounds incredible.

Expand full comment

To continue to serve as attorney for the doomer case...

The movement in to the AI fantasy world negatively affects others in that it weakens the bonds between human beings, thus undermining marriages, friendships, society at large etc. You know, in a couple of years your kids won't talk to you at the dinner table because they'll be too busy looking at their phones. Future AI based fantasy realities just amplifies and extends what's already happening. What's been happening for thousands of years really.

It's a very complicated picture obviously, and not all one thing or another.

Expand full comment
Sep 20, 2023Liked by Alberto Romero

I doubt anyone outside of OpenAI knows the answer to this, but what’s your take on the phrase “available to GPT Plus consumers”? Does that mean included for their subscription price or does it mean the ability to add on this feature for a surcharge?

Expand full comment
author

I interpret it as meaning "included in the price for ChatGPT Plus." And I hope that's the case. If DALL-E 3 is as good as they say I'll probably subscribe to Plus again (I tested GPT-4 for a while but after realizing I had no use case worth $20/month, I unsubscribed).

Expand full comment

For what it's worth, prior communication about "ChatGPT Plus customers" meant exactly this: Included in the price. Like with the Code Interpreter (now "Advanced Data Analysis")

Expand full comment

Opting out is a cop-out. If OpenAI really wants to be ethical and do the right thing by artists, they would use the opt-in model. In other words, no scraping the Internet for images of art or photography without explicit permission of the artists.

Expand full comment
author

Right. The thing is, they want to be ethical as long as it doesn't clash with the business. Which can be criticized but what do we expect exactly? That cliche "blame the game, not the player" applies here (FWIW I think we should all strive to be slightly more ethical than the minimum that "the game" forces us to be - the world would be a better place).

Expand full comment

Very interesting. Do you know if it allows training on your own photos?

Expand full comment
author

I don't think OpenAI has said anything about fine-tuning options yet. If they allow that with DALL-E 2 (I don't know), they will likely allow it with DALL-E 3, and vice versa.

Expand full comment

Except we haven't had a chance to actually try DALLE-3 yet. I'll judge it when I can actually run tests.

Expand full comment
author

Yep, we'll have to check its limitations - where it fails and where it succeeds. Even if it has increased language capabilities those capabilities will have limits somewhere. Our job will be finding them.

Expand full comment
Sep 21, 2023Liked by Alberto Romero

>> ... I can't help but feel that we—we, humanity—are losing something every time we take a step in >> this direction.

>> What do you think?

Since you ask, my own feeling is that we would losing something way more important if we *didn't* "take a step in this direction"!

Expand full comment
author

Also true, definitely. These things tend to be more complex than black and white.

Expand full comment

I'm kind of shocked. It happened pretty fast.

Expand full comment
author

Actually DALL-E 2 went out almost a year and a half ago. A rather long time by AI standards. I think this is OpenAI's sign that they're definitely going all in on multimodality (probably to fight Google's Gemini).

Expand full comment

I'll definitely dive in as soon as I can and let you know what I think.

Expand full comment

Creating scenes and images to very exacting specifications is something that we should have been conditioned to thinking about ever since seeing it fictionally accomplished on the holodeck as it was seen in the late 1980s on Star Trek: TNG. I'm not altogether sure that I can manufacture indignation for having "lost something" for being able to do it now and not having been able to do it before in reality. Perhaps I'm missing something. Could someone explain what the big deal is?

Expand full comment

I think the artists would benefit from including their art in the training dataset if they made some royalty agreements

Expand full comment

OpenAI is planning is release Dalle 3 with ChatGPT in early October. I feel this is going to be similar to the image generation within the Bing AI chat - but better. I observed that texts got better in Dalle 3 (for instance, the 21 in your cover image and of course this -> https://images.openai.com/blob/0303dc78-1b1c-4bbe-a24f-cb5f0ac95565/avocado-square.png?trim=0,0,0,0&width=2000).

Thats all cool, but one thing puzzles me. On their website, they said -- "As with DALL·E 2, the images you create with DALL·E 3 ... don't need our permission to reprint, sell or merchandise them"

The monetization part of the AI-Generated images - which by nature is made from taking creative instances from thousands of images - is still complicated.

Sure, you can disallow GPTBot from crawling your work. However, that only works possible when the bot hasn't already crawled your works. Even the site says, "Creators can now also opt their images out from training of our *future image generation* models."

What should be the way of managing the monetisation of AI-Generated Artworks? In worst cases, it could feel like reselling Picasso's work by distorting the canvas and sprinkling it with filters and colours of Van Gogh's.

Expand full comment

Well currently nothing stopping from selling, but I believe the copyright office struck down copywriting a particular image that was generated just from prompting. So that is already definitely a restriction compared to creating an image yourself.

I feel like it is hard to get a handle on what this kind of training really means though in terms of the original creators. Obviously it couldn't do what it does without training on the original works, and in some sense is competing with those same people, but I also feel like people tend to get the wrong idea of how it works. Like saying it collages a bunch of images together or filters/distorts images. While we can't say it is learning in a fully human sense, to me it feels more like that than it does a mashup.

No idea how DALL-E 3 works internally, but a common way image generators are trained (from my understanding) is a two-step process. First an image classifier is trained (or an existing one like CLIP is used). When you give that model an image, it spits out a bunch of labels with confidence percentages. So if you gave it an oil painting of a red car, maybe it has "painting: 85%, red 95%, car 90%, truck 40%, lizard 0.1%, etc".

Now the image model is trained. An image is given to the classifier, which spits out the labels and confidence levels. Then some amount of noise is added to the image. Now that noisy image and the labels are given to the image model. It then tries to reduce the noise according to the labels. So in our above example, it tries to make it more "painting" and more "red", but not more "lizard". To do this, it needs to look for those concepts in the existing image (like how a person sees animals in clouds) and add additional details. A resulting image is created and now that new image is given to the classifier. The labels are compared to the original labels to see how good of a job it did. If the new image is now 60% "red", then it gets feedback that it went in the wrong direction.

You can see from this process that the AI never actually sees the original image before it was made noisy, and its result isn't compared directly to the original either, but to the combination of labels involved. And it is forced to really lean into visual concepts in general, because there is no room to store all the input images.

So internally it is going to end up with some model of what "banana" means in terms of shape, texture, colors. And a style like "drawing" is likely built up in terms of what that means when applied to various kinds of low level shapes and lighting. There wouldn't be space for every permutation of every object (all angles, all photo/oil/watercolor/drawing/pastel styles, all variations of size and color) to be stored separately. Similarly, it is much smaller and more general to define "spiral" as a particular mathematical pattern and then apply colors and textures (or even objects) to that. Custom Stable Diffusion models on places like Civitai are only 2GB in size and produce very high quality. Think about how small that actually is, considering the flexibility.

And these abstractions are what lets it combine different concepts together. Like MidJourney can make a vacuum cleaner with a combination of the styles of HR Giger and Lisa Frank. As far as I know, neither actually painted a vacuum and there wouldn't be a lot of references of their styles combined. But if it is trying to progressively find shapes in noise and add diff kinds of details at the same time, it works out.

Expand full comment