I’ve been experimenting with AI image generation and learning how it works under the hood, and it strikes me as being worthy of an OP. Writing about it will solidify my understanding, and I figure I might as well share what I’ve learned with others. I’ll start with a high-level view and then fill in the details as the thread progresses.
Intuition would suggest that image generation begins with a blank canvas upon which the image is constructed. The truth is much stranger: we begin with an image that is pure noise, like TV static, but then we selectively subtract noise from it, leading in the end to a clean image.
I think of it as being analogous to the creation of a marble statue. A sculptor doesn’t create a marble statue by forming a bunch of marble into the desired shape, the way they would with a clay sculpture. They do it by chipping away at the marble until what’s left has the desired shape. In a sense, the statue was there all along; the sculptor just removed the marble surrounding it.
Diffusion models are like that. You start with pure noise, but you assume (or pretend) that there’s an image hidden underneath the noise, the way you might assume (or pretend) that there’s a statue hidden within the marble block. The trick for the model is to remove (by subtraction) precisely the right noise so that what’s left is the image, just as the sculptor must remove precisely the right marble so that what’s left is the statue.
How does the model determine the right noise to subtract? I’ll explain in detail later, but for now, suffice it to say that by being presented with tons of images during training, the model learns how to predict the noise that must be removed in order for the final image to be a good statistical match to the images in the training dataset.
Details to follow in the comments.
A bit off topic, but this has a close parallel with one of the books of Genesis. In that version, God didn’t start with a void and create matter and energy to form into our world. Instead, God started with chaos – that is, all the necessary matter and energy was already present, and what God did was organize it. He didn’t create the stuff, He created the forms out of the stuff.
I’ll be curious whether the model starts with a target image and can compare each iteration to see if it is closer to a match.
Flint:
Yeah, and evangelicals don’t like it when you point that out to them. I speak from experience.
No, because if you already had a target image, there would be little point in generating a new image that was a clone or an approximation of it. Plus you can generate images that aren’t plausibly an approximation of anything that appears in the training dataset. The only real target is the text of the image prompt, and there’s a lot of leeway in generating an image that fits the prompt.
An example:
I’m pretty sure there’s nothing even approximating that image in the training data.
I generated it with Google’s Nano Banana, using the following prompts:
Then:
Then:
Then:
After that last prompt, it produced the image you see above.
ETA: I love the weird little quirks that diffusion models exhibit, like creating a plaque with today’s date on the wall. I suppose the frog has to mount a new plaque every day.
ETA2: Also note that the grandfather clock reads 10:10. I discussed the 10:10 phenomenon in a few comments starting here.
Are you sceptical about the progress of “science”?
The pivotal promoter of “science” Anthony Fauci” needed a a redemption from mentally unstable president to live?