AI image generation via diffusion models

I’ve been experimenting with AI image generation and learning how it works under the hood, and it strikes me as being worthy of an OP. Writing about it will solidify my understanding, and I figure I might as well share what I’ve learned with others. I’ll start with a high-level view and then fill in the details as the thread progresses.

Intuition would suggest that image generation begins with a blank canvas upon which the image is constructed. The truth is much stranger: we begin with an image that is pure noise, like TV static, but then we selectively subtract noise from it, leading in the end to a clean image.

I think of it as being analogous to the creation of a marble statue. A sculptor doesn’t create a marble statue by forming a bunch of marble into the desired shape, the way they would with a clay sculpture. They do it by chipping away at the marble until what’s left has the desired shape. In a sense, the statue was there all along; the sculptor just removed the marble surrounding it.

Diffusion models are like that. You start with pure noise, but you assume (or pretend) that there’s an image hidden underneath the noise, the way you might assume (or pretend) that there’s a statue hidden within the marble block. The trick for the model is to remove (by subtraction) precisely the right noise so that what’s left is the image, just as the sculptor must remove precisely the right marble so that what’s left is the statue.

How does the model determine the right noise to subtract? I’ll explain in detail later, but for now, suffice it to say that by being presented with tons of images during training, the model learns how to predict the noise that must be removed in order for the final image to be a good statistical match to the images in the training dataset.

Details to follow in the comments.

 

31 thoughts on “AI image generation via diffusion models

  1. Intuition would suggest that image generation begins with a blank canvas upon which the image is constructed. The truth is much stranger: we begin with an image that is pure noise, like TV static, but then we selectively subtract noise from it, leading in the end to a clean image.

    A bit off topic, but this has a close parallel with one of the books of Genesis. In that version, God didn’t start with a void and create matter and energy to form into our world. Instead, God started with chaos – that is, all the necessary matter and energy was already present, and what God did was organize it. He didn’t create the stuff, He created the forms out of the stuff.

    I’ll be curious whether the model starts with a target image and can compare each iteration to see if it is closer to a match.

  2. Flint:

    A bit off topic, but this has a close parallel with one of the books of Genesis. In that version, God didn’t start with a void and create matter and energy to form into our world. Instead, God started with chaos – that is, all the necessary matter and energy was already present, and what God did was organize it. He didn’t create the stuff, He created the forms out of the stuff.

    Yeah, and evangelicals don’t like it when you point that out to them. I speak from experience.

    I’ll be curious whether the model starts with a target image and can compare each iteration to see if it is closer to a match.

    No, because if you already had a target image, there would be little point in generating a new image that was a clone or an approximation of it. Plus you can generate images that aren’t plausibly an approximation of anything that appears in the training dataset. The only real target is the text of the image prompt, and there’s a lot of leeway in generating an image that fits the prompt.

  3. An example:

    Gemini Generated Image jw2s5wjw2s5wjw2s (Custom)

    I’m pretty sure there’s nothing even approximating that image in the training data.

    I generated it with Google’s Nano Banana, using the following prompts:

    Image of a cyborg mouse, reclining with a lit cigar in one hand and a glass of cognac in the other, as a frog servant approaches bearing a hunk of swiss cheese on a plate.

    Then:

    Make the frog taller and have the frog carry the plate with the cheese on it.

    Then:

    Eliminate the second plate and cheese on the right. The only plate with cheese should be the one that the frog is carrying.

    Then:

    I still want the frog to carry a plate with cheese, but there should be no other plates with cheese in the image.

    After that last prompt, it produced the image you see above.

    ETA: I love the weird little quirks that diffusion models exhibit, like creating a plaque with today’s date on the wall. I suppose the frog has to mount a new plaque every day.

    ETA2: Also note that the grandfather clock reads 10:10. I discussed the 10:10 phenomenon in a few comments starting here.

  4. Are you sceptical about the progress of “science”?
    The pivotal promoter of “science” Anthony Fauci” needed a a redemption from mentally unstable president to live?

  5. From the OP:

    The trick for the model is to remove (by subtraction) precisely the right noise so that what’s left is the image, just as the sculptor must remove precisely the right marble so that what’s left is the statue.

    How does the model predict* the noise that it should remove/subtract? Here’s a very rough analogy: We humans can look at a noisy image and figure out what the original, pristine image probably looked like. For instance, it isn’t difficult to look at this noisy image:

    teacache 00026

    …and infer that the original image looked something like this:

    teacache 00035

    The noise is whatever we have to subtract from the noisy image to get the original image:

    original_image = noisy_image – noise

    which is equivalent to

    noisy_image = original_image + noise

    But here’s the twist. It’s also possible that the original image looked like this:

    tarantula red knee (Custom)

    Intuitively, that seems impossible, but it’s actually true. Here’s a simplified explanation in terms of black-and-white images.

    A black-and-white image is an array of pixels, and each pixel can range from perfectly black at one end of the spectrum to perfectly white at the other end, with shades of gray in between. We can represent each pixel with a number that specifies where it falls on the spectrum. Let’s stipulate that the range of the numbers is from 0 to 10. In that case 0 corresponds to pure black, 10 corresponds to pure white, and the numbers 1-9 correspond to shades of gray, with smaller numbers corresponding to darker shades and larger numbers corresponding to lighter shades.

    Now consider a particular pixel in the original image. It has a value of 6, say. When noise is added to the original image, one of two things can happen to our pixel. It can either remain at 6, or it can change to another number from 0 to 10. Let’s say it changes to 8. In that case the noise is 2 for that particular pixel, since 6 + 2 = 8. Another pixel might start out at 5 and end up at 2, in which case the noise is -3, since 5 – 3 = 2.** A third pixel might start out at 7 and remain at 7, in which case the noise is 0, since 7 + 0 = 7.

    Note that no matter which starting and ending values you pick, there’s always a way to get to the ending value by adding or subtracting the right amount of noise. That is, the original pixel can be any number from 0-10, and the ending number can also be any number from 0-10.

    Think about what that means for the image as a whole. The original image is an array of values, one for each pixel, and the noise is also an array of values, one for each pixel. That means you can transform any image into any other image simply by picking the right noise values. So the tarantula above can be transformed into the noisy girl by adding the right noise pattern.

    But wait — if any image can be transformed into any other image by adding the right noise pattern, how do we know that the first image is a noisy version of the girl and not of the tarantula? The answer is: we don’t. The reason we’re confident that original image is of a girl is because the probability of getting a random noise pattern that transforms “girl” into “noisy girl” is far higher than the probability of getting a pattern that transforms “tarantula” into “noisy girl”.

    When we look at the noisy image and identify it as a girl, we are therefore making a statistical inference about the likely noise pattern. Image generation models do something similar, and that’s how they decide what noise to predict — that is, what noise they’ll subtract.

    The analogy isn’t perfect, but it gives you a rough idea of what the model is trying to accomplish. More on this later.

    * ‘Predict’ is the standard term, but it’s really more of a retrodiction. It’s the model’s best guess as to what the noise was. I think they use the word ‘predict’ because it’s analogous with next-token prediction in LLMs, although that isn’t really prediction either.

    ** For the technically-minded, what’s actually happening is that the noise is always added to the pixel. It’s just that we’re using modular arithmetic, so if the sum is greater than 10, it “wraps around” to 0 and increases from there. In the case of our pixel that starts out at 5 and ends up at 2, the noise is actually +8, because 5 + 8 = 13, which is greater than 10, so the sum wraps around to 0 and lands at 2. Doing it this way means that both the original pixel and the noise can be represented by numbers from 0 to 10, and the only operation needed is modular addition.

  6. keiths:

    ** For the technically-minded, what’s actually happening is that the noise is always added to the pixel. It’s just that we’re using modular arithmetic, so if the sum is greater than 10, it “wraps around” to 0 and increases from there. In the case of our pixel that starts out at 5 and ends up at 2, the noise is actually +8, because 5 + 8 = 13, which is greater than 10, so the sum wraps around to 0 and lands at 2.

    Those of us not so technically minded might need a bit more explanation. To me, if we’re using base 10 and we add 8 to 5, the addition sequence is 6,7,8,9,0,1,2,3. So we end up at 3, not 2. 13 in modular base 10 has one 10-chunk which cancels out by the wrap, plus 3.Where did I go wrong?

  7. Flint:

    Those of us not so technically minded might need a bit more explanation. To me, if we’re using base 10 and we add 8 to 5, the addition sequence is 6,7,8,9,0,1,2,3. So we end up at 3, not 2. 13 in modular base 10 has one 10-chunk which cancels out by the wrap, plus 3.Where did I go wrong?

    You’re missing the fact that I specified the range as 0-10, not 0-9:

    Let’s stipulate that the range of the numbers is from 0 to 10. In that case 0 corresponds to pure black, 10 corresponds to pure white, and the numbers 1-9 correspond to shades of gray, with smaller numbers corresponding to darker shades and larger numbers corresponding to lighter shades.

    So where you say the addition sequence for 5 + 8 is

    6,7,8,9,0,1,2,3

    …the actual sequence is

    6,7,8,9,10,0,1,2

    In other words, it’s modulo 11, not modulo 10.

    It looks like you’re confusing the modulus with the base. The modulus determines how many numbers there are before wraparound — in this case 11 — but the base determines how large a single-digit number gets before it changes to a two-digit number. So in my example the base is 10, but the modulus is 11.

  8. keiths:
    Flint:

    You’re missing the fact that I specified the range as 0-10, not 0-9:

    Yep, I missed that detail. When I was writing checksum code, the range was 00 to FF (where it wrapped), but the base was hex. (That was for a byte checksum. Just keep adding to the byte and ignore the carry.)

    Does the diffusion model you’re using use modulus 11?

  9. Flint:

    Does the diffusion model you’re using use modulus 11?

    No, I just picked mod 11 for my example because I thought it would be the most intuitive for readers. It’s analogous to typical volume knobs where 0 means no sound and 10 is full blast.

  10. Above, I use the multi-colored girl to show that even though we don’t realize it, we’re making a statistical inference about the noise in the noisy image when we infer that it is a picture of a girl. Diffusion models are also making such an inference.

    In that example, there’s a strong signal underneath the noise, and it’s the signal that enables us to infer what the original image most likely looks like. What about AI image generation, where we’re starting with random noise — TV static, effectively? There’s no signal there. It’s all noise.

    The answer is that the model doesn’t care. It just assumes that there’s signal present and proceeds as if it were. Even though we humans can’t see it, that random noise bears more of a resemblance to some possible images than others, and the model will make its predictions in a way that favors those more likely images.

    The model repeats the prediction process multiple times. The flow is

    1. Look at the current noisy image.
    2. “Predict” the most likely noise pattern.
    3. Subtract that noise from the image.
    4. Repeat steps 1-3 for a predetermined number of iterations.

    The initial prediction will nudge the image in a certain direction, and subsequent predictions will amplify that and keep it going in that general direction. Eventually the predictions get you to a recognizable and clear final image. Here’s how the process unfolds as the model generates the multicolored girl:

    Comfy-UI-00470-Phone
    Comfy-UI-00465-Phone
    Comfy-UI-00460-Phone
    Comfy-UI-00455-Phone
    Comfy-UI-00450-Phone
    Comfy-UI-00445-Phone

    The first image is pure noise, but each subsequent image is the result of subtracting the predicted noise from the previous image. Thus the gradual convergence.

    ETA: I should note that I’m actually cheating a little bit. There were more than five steps in the generation process (can’t remember how many), so there are some intermediate images that I left out above for the sake of brevity. In real life, for instance, the model wouldn’t be able to get from the second-to-last image to the last image in a single leap. But the principle is exactly how I described it.

  11. keiths:

    In that example, there’s a strong signal underneath the noise, and it’s the signal that enables us to infer what the original image most likely looks like. What about AI image generation, where we’re starting with random noise — TV static, effectively? There’s no signal there. It’s all noise.

    If you start with pure noise, would the image converge the same way every time, or would the final image vary? Would it depend on the exact distribution of the noise, so a given noise sample will produce a given image?

  12. Flint:

    If you start with pure noise, would the image converge the same way every time, or would the final image vary? Would it depend on the exact distribution of the noise, so a given noise sample will produce a given image?

    Different starting noise patterns lead to different final images. That’s because the model assumes that there’s signal buried in the noise even though there really isn’t, so it’s gravitating toward an image that best matches the noise, and that will vary.

    Here’s an example. In the case of the multicolored girl, there’s a dark patch in the starting noise pattern. I’ve circled it below:

    Comfy-UI-00470-dark-spot-Phone

    If you look at the series of images in the earlier comment, you can see that the dark spot morphs into her eye. I believe the model saw that dark spot, effectively thought “that isn’t a coincidence — I’ll bet there was a feature there in the original image that the noise wasn’t able to obscure.” So it heads in a direction where that dark spot becomes a feature. It would be interesting to “seed” a bunch of dark patches into random noise patterns to see if they would consistently coalesce into features, but with all the other experiments I’m running, I’m not sure I’ll ever get around to it.

    The model will always make the same prediction when presented with the same starting noise pattern, so the entire process is deterministic. A given noise pattern will always lead to the same final image.

    EXCEPT: you can configure the model to inject some random noise on top of the noise predicted by the model, in which case the final image will vary depending on the injected noise even though the starting noise pattern is the same.

    Even with noise injection, though, the process will be deterministic if you use the same starting noise pattern and the same seed for the pseudorandom number generator that feeds into the noise injector.

    Also, I haven’t talked about the prompt yet. That’s deliberate — I want to keep things simple for now. But changing the prompt will obviously produce a different final image even if the starting noise pattern is the same. With a different prompt, for instance, that dark patch might coalesce into something other than an eye, or it might not coalesce into a feature at all.

  13. A couple of comments ago, I wrote:

    The initial prediction will nudge the image in a certain direction, and subsequent predictions will amplify that and keep it going in that general direction.

    That’s actually not a metaphor. There’s a sense in which the image really does move around — it’s just that it moves around in image space, not physical space.

    Image space (typically called ‘pixel space’, for reasons I’ll explain later) is an abstract mathematical space containing all possible images, including meaningful ones, nonsensical ones, boring ones like all black or all green, and every possible noise pattern. If it can be displayed on a screen, it lives somewhere in image space.

    There is a single point in image space for every possible image, including one point for each variation of the multicolored girl above. Thus, when the model predicts a noise pattern and subtracts it from the current image, producing a cleaner one, it is moving the image from one point in image space to a different point. It moves the image in a definite direction for a definite distance. It really is motion, albeit motion in an abstract space.

    To get a feel for this space, consider the black-and-white images I described earlier, where each pixel is represented by a number from 0 to 10, indicating the intensity of the pixel: 0 corresponds to pure black, 10 corresponds to pure white, and the numbers 1-9 correspond to shades of gray, with smaller numbers corresponding to darker shades and larger numbers corresponding to lighter shades.

    The simplest possible image is one that has only one pixel. Boring, but technically an image. The second simplest is one with two pixels. We’ll use a two-pixel image for now to make the image space easier to understand.

    How many possible two-pixel images are there in our system? Well, the first pixel can have any intensity value from 0 to 10, giving 11 possibilities. The second pixel also has 11 possibilities. That means that the number of possible images is 11 x 11, or 121. Our image space therefore needs to have 121 points.

    How do we arrange the points? The most elegant way is to arrange them in a square and to plop that square down on the x/y coordinate plane. The x-axis gets numbered from 0 to 10, and the x-coordinate signifies the intensity of the first pixel. The y-axis also gets numbered from 0 to 10, and the y-coordinate signifies the intensity of the second pixel. Now you can plot any image by using the pixel intensity values as x-y coordinates. In an all-black image, both pixels will have intensity 0. That corresponds to the point (0,0) on the graph. An all-white image will occupy point (10,10). If the first pixel is black, and the second pixel is white, then the image will occupy the point (0,10). If it’s the other way around, and the first pixel is white while the second pixel is black, then the image maps to the point (10,0). Those are the four corners of the square — the four corners of the image space. Every other possible image falls within that square, at a point (x,y) where x is the intensity of the first pixel and y is the intensity of the second.

    So for a two-pixel image, the image space is a square. Let’s bump that up to three pixels. Now there are three intensity values to plot, one for each pixel. We’re already using one axis for each of the first two pixels, so now we need an additional axis for the third one — the z-axis. Our image space is now three-dimensional, in the form of a cube, and the coordinates of any particular image are (x,y,z), where each of those is the intensity of the corresponding pixel.

    You can see the pattern. Every pixel we add requires a new axis, and the image space gains a dimension. A 4×4 image has 16 pixels, and therefore requires a 16-dimensional image space. A 512×512 image has 262,144 pixels, so the image space has 262,144 dimensions.

    These hyperdimensional spaces are impossible to visualize, but fortunately we don’t need to. The important point is that they are as legitimate mathematically as two- or three-dimensional space and can be manipulated using the same rules. There are directions and distances in these hyperdimensional spaces, just as there are directions and distances in three-dimensional space.

    The diffusion model pushes the image to a different location in image space with each predict-noise-and-subtract-it step. The goal is to push it to the location of a coherent image that matches the statistical characteristics of the images that the model was trained on.

    During training, the model effectively learns where the good spots are in image space and what spots to avoid, and during image generation, it uses what it has learned to steer the image in the right direction through image space.

  14. I’ve glossed over this so far, but every time the model makes a noise prediction, it’s actually predicting the noise that would need to be subtracted in order to go all the way through image space to what it thinks is the original, uncorrupted image.

    The predictions are imperfect and they can be downright bad, especially toward the beginning of image generation, because there is a lot of noise in the image during that time and relatively little signal. That means that if you take the leap and subtract all of the predicted noise, you’re unlikely to land in a good spot in image space. The model’s “aim” is off. It’s pointing in the right general direction, but that’s all.

    Since the aim is off, we don’t take the entire predicted leap. We scale the noise down so that we only go part of the way. Since we’re only going partway with each step, it takes multiple predict-and-subtract steps to actually reach our final image. But that pays off. The closer we get, the better our predictions are, because there’s more signal and less noise, so we home in on a very good image.

    Here are some examples showing how the final image improves when you use multiple steps. The prompt is

    A clown in a bright multicolored suit, standing on an iceberg and juggling.

    First image is what you get with only one step, where we’re taking the entire leap at once without scaling the noise down:
    x1 00001 (Custom)

    2 steps this time:
    x2 00001 (Custom)

    3 steps:
    x3 00001 (Custom)

    4 steps:
    x4 00001 (Custom)

    6 steps:
    x6 00001 (Custom)

    8 steps:
    x8 00001 (Custom)

    12 steps:
    x12 00001 (Custom)

    16 steps:
    x16 00001 (Custom)

    32 steps:
    x32 00001 (Custom)

  15. A few observations:

    1. All of the clown images above are finished images. That is, we ran a procedure, subtracted all of the predicted noise, and at the end of that procedure we arrived at a final location in image space. The blurry images are just as complete and final as the crisp ones, but since we took big leaps instead of dividing them up into smaller leaps, we missed the desirable portion of the image space. Our aim was off, and we didn’t have enough steps in which to fine-tune it. However, although our final images are blurry, they aren’t noisy, which is interesting.

    2. All of the multicolored girl images are unfinished images except for the last one. The noise hasn’t all been subtracted, so they are grainy.

    3. I can do the same thing with the iceberg clown. This is what the clown looks like when configured for 32 steps but interrupted after 16:
    x16 32 00001 (Custom)

    Lots of noise remaining.

    The upshot: a well-trained model, even when operating on an image with a lot of noise, can do a good job of getting rid of the noise in a single leap. What it can’t do in a single leap is aim precisely enough to get an image that is crisp as well as non-noisy. We get a blurry image.

  16. So far, the image generation flow I’ve described looks like this:

    1. We start with an image containing nothing but pure random noise, like TV static.

    2. We pretend that there’s a signal within the noise. That is, we assume that instead of looking at random noise, we’re looking at a clean original image that has been corrupted by noise:

    current_image = original_image + noise_pattern

    3. We ask: of all the possible noise patterns that could have corrupted the original image, which of those patterns is the most likely, given the noisy image we’re looking at right now? This is known as “predicting” the noise, though it’s really a sort of fictional retrodiction.

    4. We subtract a portion of the predicted noise from the current image. Within image space, that causes us to move in the general direction of the final image, but we don’t make the entire leap because our aim is still poor and jumping all at once would leave us in the wrong part of image space, with a blurry image.

    5. We repeat steps 3 and 4 a predetermined number of times, and then we have our final image, which looks like it could have been part of the training data though it really wasn’t.

    (We’re still neglecting the prompt for now.)

    The magic is all in the noise prediction. How on earth can the model look at a noisy image and predict what the noise most likely was? The answer is training: we have to train the model to associate noisy images with the noise that corrupted them. The noise predictor is a neural network, and we have to teach it to predict noise accurately.

    We do that by giving the model zillions of examples of images that we have deliberately corrupted by adding noise. We feed each noisy image into the model alongside the noise that we used to corrupt it, and we effectively tell the model “If you see a noisy image that looks sort of like this, you should predict a noise pattern that looks sort of like that.”

    The noise predictor is a neural network, and teaching it means adjusting its synaptic weights so that it gives better answers.

    Spelling it out, we:

    1. Randomly select an image from the training dataset.

    2. Generate a random noise pattern.

    3. Corrupt the image by adding the noise to it.

    4. Present the corrupted image to the neural network and ask it to predict the noise pattern that was added.

    5. Compare the predicted noise to the actual noise. The prediction will never be perfect. There will always be an error — a difference between predicted and actual.

    6. Adjust the synaptic weights of the network in a way that reduces the error. By reducing the error, you improve the prediction.

    7. Repeat steps 1-6 a zillion times over the entire collection of training images.

    At the end of the process, the model will have implicitly learned the statistical characteristics of the training dataset. When it encounters a noisy, corrupted image, it will notice the difference between the statistical characteristics of that image and the overall statistical characteristics of the training data, and it will be able to predict the most likely noise pattern. Subtracting the predicted noise iteratively, we arrive at a final image that shares the statistical characteristics of the training images. In other words, we arrive at an image that wasn’t part of the training dataset but looks statistically like it could have been. It fits right in.

    Simplified concrete example:
    Suppose we have trained our model on 3.7 zillion cat pictures and nothing else. From all those examples, the model has thoroughly learned the statistical characteristics that make an image cattish. When we present it with a random noise pattern at the beginning of image generation, it asks itself “what is the most likely noise pattern that could have turned a cattish image into the image I’m staring at right now?” At each stage of image generation, it subtracts a portion of that noise, thus moving the image in a cattish direction. We arrive in the end at an image that is very cattish and fits right in with the training images, though it wasn’t one of them.

    In a nutshell: we deliberately corrupt a bunch of images in order to teach the model how to un-corrupt them, and then we let it loose on fake corrupted images — pure noise — and ask it to un-corrupt those. We end up with images that look like they could have been part of the training dataset, but weren’t. Instead of creating an image from the ground up, we just assume that the image is already there, but corrupted by noise, and our job is simply to get rid of the noise and reveal the image that was there all along.

    It’s wild. Crazy that it actually works so well.

    More later on how the synaptic weights get adjusted during training in order to improve the model’s noise predictions.

  17. Sounds like directed evolution. Where is the difference?

    Back when we were discussing Weasel, I wrote a little program that converged on words. Not any particular word, but any of the 200,000 or so that I downloaded.

    It was not. Pretty, but it worked. It sounds like LLMs are like that, only good.

  18. petrushka:

    Sounds like directed evolution. Where is the difference?

    Well, there’s no selection. Each image begets just one offspring — the image minus some of the predicted noise. There are no other images to compete against. It’s guaranteed that each image will reproduce, and it’s guaranteed that its single offspring will survive and reproduce as well.

    In directed evolution, by contrast, the mutations are targeted but selection still operates. There’s competition, and only the fittest survive.

    Back when we were discussing Weasel, I wrote a little program that converged on words. Not any particular word, but any of the 200,000 or so that I downloaded.

    It was not. Pretty, but it worked. It sounds like LLMs are like that, only good.

    If your program was Weaselish, then you presumably had selection. You’d produce multiple offspring with differing mutations. The ones that best matched one of the words in your 200,000-word corpus would survive and reproduce. The others would die childless. Also, your mutations presumably were random, not directed.

    LLMs, like diffusion models, don’t have selection. Also, the parent/offspring scheme doesn’t hold for LLMs. If an LLM is generating the sentence “The cat sat on the mat”, “mat” isn’t the offspring of “the” — it’s the offspring of “The cat sat on the”. So what you’re producing with each token isn’t the next parent — it’s the piece that gets welded onto the current parent in order to get the next parent:

    “the” begets “cat”
    “the cat” begets “sat”
    “the cat sat” begets “on”
    “the cat sat on” begets “the”
    “the cat sat on the” begets “mat”

    The mutations are very much directed, unlike in Weasel, though LLMs typically add some randomness on top of the directed mutations. And the mutations always add to the parent. They never change it or subtract from it.

  19. As explained earlier, here’s the training flow for the diffusion model:

    1. Randomly select an image from the training dataset.

    2. Generate a random noise pattern.

    3. Corrupt the image by adding the noise to it.

    4. Present the corrupted image to the neural network and ask it to predict the noise pattern that was added.

    5. Compare the predicted noise to the actual noise. The prediction will never be perfect. There will always be an error — a difference between predicted and actual.

    6. Adjust the synaptic weights of the network in a way that reduces the error. By reducing the error, you improve the prediction.

    7. Repeat steps 1-6 a zillion times over the entire collection of training images.

    How is #6 accomplished, and how does the system figure out which weights need to be adjusted and by how much?

    Here’s an artificial neuron:
    artificial neuron (Custom)

    An artificial neuron isn’t a physical object. It’s a software creation that runs on normal computer hardware. Its design is inspired by the way real neurons work, but there are important differences.

    Here’s a description of how it operates. Multiple inputs come in from the left, get processed by the neuron, and produce a single output to the right. Each input is a numerical value, and the output is also a numerical value.

    Each input (one of the Xs in the diagram) is multiplied by a weight (one of the Ws). The results are added together, and the bias term b is also added. The addition happens in the unit labeled ‘Σ’. The sum is passed through the activation function (labeled ‘σ’) and the output of the activation function is the output of the neuron, labeled ‘Y’.

    In the earliest artificial neurons, the activation function was simple. If the input to the function was greater than a fixed threshold, the neuron would fire. Otherwise it wouldn’t. This is similar to how biological neurons work — they either fire or they don’t. That’s no longer true for artificial neurons. For technical reasons that I’ll describe later, they deviate from this fire-or-don’t-fire scheme. Instead of an all-or-nothing output, you get a number.

    If you think about it, each neuron is really just a mathematical function. It takes a bunch of numbers as inputs and produces a number as its output. That’s reflected in the diagram by the expression

        \[ Y = \sigma\left(\sum_{i=1}^{n} w_i x_i + b\right) \]

    …where n is the number of inputs. (There is an error in the diagram. They put a subscript on the b, when there is really just one b value for the neuron.)

    Neural networks are just networks formed by hooking a bunch of artificial neurons together. Here’s a common architecture:
    Screenshot 2026 02 04 173737 (Custom)

    The information flow is again from left to right, but now the blue circles represent individual neurons. In this network, every neuron in a given layer is connected to every neuron in the next layer, so the interactions can be pretty complicated. In modern state-of-the-art LLMs, there are more than a hundred layers.

    The complexity of a neural network isn’t typically measured in terms of the number of neurons. It’s measured by the number of parameters, where every weight and bias counts as a parameter. This makes sense, because the number of inputs to a neuron can vary widely depending on the kind of network it’s a part of. The number can range from single digits to thousands, and it would be silly to count a 9-input neuron the same way you would a 999-input one. The parameter count for the networks of the most advanced LLMs these days is approaching a trillion.

    The weights and biases of all the neurons are modified after each training step in a way that reduces the error — the discrepancy between the output of the network and what we want the output to be. In the case of a diffusion model, the network receives a noisy image as input, and the output is the network’s best guess as to what the noise pattern is.

    Suppose we’re at a given training step. We’ve taken a random image from the training dataset and added a random noise pattern to it. We want to train the network so that when it sees a noisy image that looks sort of like the one we just created, the output of the network will be an accurate prediction of the noise that was added to the original.

    We present the noisy image to the network and observe its output — the predicted noise pattern — and compare that to the actual noise pattern we used. The difference is the error, called the “loss” in the jargon. We want to reduce the loss, so we adjust the weights and biases in a way that will reduce the loss. But how can we do that when the network contains hundreds of billions of parameters, all of them interacting with each other?

    The solution is to change the parameters one by one while leaving everything else alone. One way to do that would be by trial and error: hold all of the parameters constant except for the one you’re adjusting. Try various values and settle on the one that gives you the best results — the least loss.

    Or more likely, just nudge the parameter in the direction of the optimal value rather than changing it wholesale. There are reasons for preferring incremental steps that I’ll discuss later.

    As I just described it, the training process would be horrendously inefficient, because you’d have to re-evaluate the outputs of the network every time you made a change to a parameter. There’s a way to make the process much more efficient which I’ll describe later.

    For now, the idea is this:

    1) present an input to the network;

    2) observe the output;

    3) compare that to the output you wanted; the difference is the error, or “loss”;

    4) one by one, adjust the parameters to reduce the loss while holding everything else constant; and

    5) repeat steps 1-4 zillions of times.

    If you do that, and if the network is large enough and has the right architecture, and if your training dataset is sufficient, the parameters will settle down to values that give consistently good results for a wide range of inputs. Training will have taught the network how to produce good outputs. That’s why they call it “machine learning”.

  20. Training would be horrendously inefficient if we had to tweak parameters one by one and observe the effect on the outputs of the network. Fortunately, there’s a much faster way.

    As explained above, each neuron can be see as a mathematical function:

        \[ Y = \sigma\left(\sum_{i=1}^{n} w_i x_i + b\right) \]

    The whole network is therefore a bunch of these functions chained together. If you chain a bunch of functions together, what you have is still a function. The whole network is a function. It’s a vector-valued function, because there are many outputs, but it’s a function nonetheless. Each element of the vector is a single output, and that output can be described as a function of all the inputs, all the weights, and all the bias terms in the network. It’s a multivariable function. And not just multivariable, but hypervariable, with billions of arguments in the case of LLMs.

    But functions are functions, and there’s a neat property of functions that we can take advantage of. Under the right conditions, if we have a formula for our function, we can derive a formula for how the output changes in response to changes in a given input. (Those of you who’ve taken multivariable calculus will recognize that what I’m talking about is the partial derivative with respect to that input.) Instead of having to plug in different values for the parameter we’re adjusting and then re-evaluate all of the network outputs, we have a formula that will tell us directly how to adjust the parameter to get the desired results.

    Here’s a simple example that shows how this works. Instead of looking at a neural network, let’s look at a much simpler function:

        \[ f(x,y) = x^2 + y^2 \]

    We have two inputs, x and y, and one output. Let’s say that x and y are currently 3 and 4, respectively. Then the value of the function is

        \[ f(x,y) = 3^2 + 4^2 = 9 + 16 = 25 \]

    Suppose we want to make the output smaller by adjusting x, but we don’t want to have to plug in values for x in order to discover which values have the desired effect. (I know, it’s perfectly obvious from looking at the function that decreasing x will decrease the output at this location, but the principle I’m illustrating here will work even when it isn’t obvious how a function will respond to a change in its inputs. That’s what we need, because the functions we’ll be dealing with are super complex.)

    What is the formula that will tell us how the function will respond to changes in x? Those of you who know calculus will be able to figure it out from the function itself:

        \[ f(x,y) = x^2 + y^2 \]

    We differentiate with respect to x, holding y constant, giving

        \[ \frac{\partial f}{\partial x} = 2x \]

    If you don’t know calculus, you’ll have to take my word for it, but the formula is 2x and the value of that formula at x = 3 is 6, so we know that in the immediate vicinity of x = 3, for every tiny increase you make to x, the function will increase by about 6 times as much. This also means that for every tiny decrease in x, the function will decrease by about 6 times as much. Since we want to make the output smaller, we need to make x smaller, and the magnitude of 6 gives us an idea of how large the effect is.

    The equation for a neural network output is vastly more complicated than the example I just used, but the principle is the same. If you can take the partial derivative of a function with respect to the parameter you’re adjusting, you can discover whether you need to increase or decrease the parameter and you can get an idea of how big the effect will be.

    All of this depends on our ability to take partial derivatives. There’s a major hitch, though. I described above how early artificial neurons would compare the weighted sum of the inputs and the bias term against a threshold, and then give an all-or-nothing answer depending on whether the weighted sum was greater than or less than the threshold value. That means the output of such a neuron would be a step function. It would look something like this:

    step (Custom)

    The x-axis represents the weighted sum, and the function instantaneously shoots up from 0 to 1 the moment we cross the threshold. (The threshold happens to be zero in this particular case.) That means that the partial derivative doesn’t exist at that point, which means we can’t use the partial derivative trick for adjusting parameters. In mathematical language, we say that the function isn’t differentiable at that point.

    The solution? Substitute a function that is differentiable there but still has an approximately step-like behavior. For example, the function

        \[ \frac{1}{1 + e^{-6x}} \]

    …gives us this:

    sigmoid (Custom)

    There’s a variety of functions to choose from, and people will choose different ones depending on the application, but the point is that they are all differentiable and they all have a steplike profile, which means that we can use the partial derivative method for adjusting parameters.

    Some more technical details to follow.

  21. A technical note:

    I got something wrong in the comment above. I thought the reason that the step function was replaced with the s-curve (aka the ‘sigmoid’) was because the step function isn’t differentiable. It is true that it isn’t differentiable at 0 because it shifts instantaneously from 0 to 1, thus making the derivative undefined there. However, you could easily fix this by “plugging the hole” where the derivative is undefined. The derivative is 0 on either side, because the value of the function doesn’t change — to the left of 0, the value of the step function is always 0. Given that there’s no change, that means the derivative is always 0. To the right of 0, the value of the step function is always 1. It doesn’t change. Therefore, by the same logic, the derivative is 0.

    If the derivative is 0 on either side with a hole at a single point in the middle, you can just plug that hole with a 0 and you now have a function that is a straight line with a constant value of 0 for all values of x. And that’s the real problem. If the derivative is always 0, then it gives you no information during training about which way to nudge the parameter in order to reduce the error (aka the loss). That makes training impossible.

    You need a function with a derivative that varies. The sigmoid is such a function. It also happens to be differentiable everywhere, meaning that you don’t have to do any hole-plugging, but the differentiability isn’t what makes it suitable for training — it’s the variation in the value of the derivative.

    I figured this out because I learned that the sigmoid was largely replaced by a third function known as the “ReLU” (“Rectified Linear Unit”). It looks like this:

    Re LU (Custom)

    It’s defined as f(x) = max(0,x). The sharp corner at 0 means that it isn’t differentiable there. That threw me off at first, because I thought the whole point of getting rid of the step function was to replace it with something that is differentiable everywhere. However, ReLU is like the step function in that you can easily plug the hole in the derivative. But unlike the step function, the derivative actually contains useful information for training because it varies depending on which side of 0 you’re on. It’s 0 to the left of 0, and 1 to the right.

    Even ReLU has largely been replaced by variations, but I won’t describe those here. I just wanted to make the point that the issue isn’t whether the function is differentiable everywhere — it’s whether the derivative varies and is therefore informative for training purposes.

  22. In all of the above, I forgot to mention that this training method is known as “backpropagation”, for a reason I’ll now explain.

    Recall that the equation of a single artificial neuron is

        \[ Y = \sigma\left(\sum_{i=1}^{n} w_i x_i + b\right) \]

    The xs are the inputs and the ws are the weights. In a multilayer network, the inputs to a given layer are the outputs from the previous layer, and the information flows forward from each layer to the next. So in the expression above, each of the xs can be seen as a function of the previous layer’s inputs. And that layer’s inputs can be seen as a function of its previous layer’s inputs. And so on, all the way back to the original inputs on the left, regardless of how many layers there are:

    Screenshot 2026 02 08 095920 (Custom)

    So the outputs of the network can be seen as functions of functions of functions… of functions of the network’s inputs. They are chains of functions, in other words. In training, we want to adjust all of the weights in the entire network so as to reduce the error — that is, to cause the output of the network to better match the desired output. We therefore need to know, for each weight, which way to nudge it to reduce the error, and by how much.

    As explained earlier, this means knowing, for each weight, the partial derivative of the output with respect to that particular weight. Since the overall function is a chain of functions of individual neurons, this means that we can apply the chain rule to get the overall derivative from the derivatives of each neuron.

    Suppose we’re trying to adjust the weights in one of the chains, and we’re working on the weights of the sixth-to-last neuron in that chain. Assign the following letters to the outputs of each neuron in the chain:

    sixth-to-last: f
    fifth-to-last:
    g
    fourth-to-last:
    h
    third-to-last:
    i
    second-to-last:
    j
    last:
    k

    The particular weight we’re interested in is a weight of the sixth-to-last neuron. Call it w. We want to know what effect adjusting w has on the output at the end of the chain, which is k. In calculus terms, we want to know

        \[ \frac{\partial k}{\partial w} \]

    The chain rule tells us that

        \[ \frac{\partial k}{\partial w} = \frac{\partial k}{\partial j} \cdot \frac{\partial j}{\partial i} \cdot \frac{\partial i}{\partial h} \cdot \frac{\partial h}{\partial g} \cdot \frac{\partial g}{\partial f} \cdot \frac{\partial f}{\partial w} \]

    …which means that by knowing the partial derivative at each link in the chain, we can get the partial derivative of the entire chain with respect to w, and that enables us to adjust w properly.

    In practice, that means that we observe the error at the output of the network and then work backwards through the layers, using the partial derivatives to tell us how to adjust the weights as we go in order to reduce the error. So while information flows forward from left to right during the normal operation of the network, it flows backward during training. Thus the name “backpropagation”.

  23. The backpropagation procedure I just described is a form of what is known as “gradient descent”. You’ll hear lots of references to gradient descent in discussions of AI, so I thought it was worth an explanation.

    Earlier in the thread, I explained how an entire image can be represented by a single point in a special mathematical space referred to as “image space” (or more commonly “pixel space”). In my example, each pixel in a black and white image is represented by an intensity value in the range 0-10. Pixel space is a hyperdimensional space in which there is a separate dimension for every pixel. A 512×512 black and white image has 262,144 pixels, so the corresponding image space has 262,144 dimensions.

    In 3D space, a point can be specified using 3 numbers: for example, (3, 0, 19). In a 262,144-dimensional space, we need 262,144 numbers: (Intensity of the first pixel, intensity of the second pixel, intensity of the third pixel… intensity of the 262,144th pixel). That specifies a single point in this weird mathematical space, so a single point is all you need to represent an entire image.

    As with pixels, so with weights (and biases). You can imagine a “weight space” or “parameter space” in which every weight gets its own dimension.* The entire neural network is then representable as a single point in this space, and the coordinates of that point are just the individual weight values: (value of weight 1, value of weight 2, value of weight 3,… value of last weight). There are hundreds of billions of weights in a modern state-of-the-art LLM, so the weight space for that LLM has hundreds of billions of dimensions. It’s impossible to visualize, but that doesn’t matter, because the math works regardless.

    Just as the image generation process changes the pixels in order to steer the image to a good location in image space, training changes the weights in order to steer the neural network to a good location in weight space.

    Now, recall that the purpose of training is to minimize the error (or “loss”) of the network — the discrepancy between the output it produces and the output we want it to produce. That means that the magnitude of the error depends on where we are in weight space. In other words, the magnitude of the loss is a function of the location in weight space. The loss function therefore forms a surface in weight space.* The higher the surface, the greater the loss, so our goal in training is to slide down this surface to the lowest point we can reach.

    [Veterans of the intelligent design wars here and at Uncommon Descent will remember our discussions of the “fitness landscape” and whether evolution would get trapped on “islands of function”. The fitness landscape can be thought of (loosely) as a surface in gene space, and evolution’s “goal” is to climb toward higher points on that surface. In that way it’s similar to our goal in weight space, except that we want to go lower, not higher, in weight space.]

    So the loss function forms a surface in weight space. At any point on that surface, there’s a particular direction in which we could move in order to climb the steepest. That direction, together with the steepness, is a vector, and that vector is called the “gradient”.

    We don’t want to increase the loss — we want to reduce it. If going in the direction of the gradient gives us the steepest ascent, going in the exact opposite direction will give us the steepest descent, and that’s what we want. That procedure — going in the exact opposite direction of the gradient — is known as “gradient descent”. The neural network is a single point on the loss surface in weight space, and our goal is to slide that point down the loss surface to the lowest point we can reasonably get to.

    We need to know the gradient so that we can go in the opposite direction. How do we determine the gradient? Turns out we’ve already done it. In my description of backpropagation above, I talked about finding the partial derivatives of the output with respect to every weight. Each of those derivatives is the component of the gradient along the axis of the corresponding dimension. In other words, if we simply combine all of the partial derivatives into a vector, we have the gradient.

    Going in the opposite direction of the gradient is therefore equivalent to moving each weight in a direction that reduces the value of its partial derivative, and we’ve already described how that is done. We just have a new name for it, “gradient descent”, and a new intuition for it, which is taking the steepest path down a surface in a hyperdimensional space.

    * I’m being deliberately sloppy in this description just to keep things simple. The weight space actually includes both weights and biases (which is why it’s often called “parameter space” instead of “weight space”), and the space in which the loss function lives actually has one more dimension than the weight space, because you need an extra dimension in which to represent the value of the loss function. I just didn’t want to clutter the description with those distracting details.

  24. After reading my description of how image generation works, you might be wondering where the term “diffusion model” comes from. Nothing stands out as particularly diffusion-like about the process.

    The answer is that it’s the training, not the generation, that looks (sort of) like a diffusion process.

    Recall that during training, we select an image from the training dataset, add some random noise to it, and then teach the network to associate the noisy image with the random noise pattern that was used to corrupt it. Also recall that an entire image can be represented by a single point in pixel space.

    When we add random noise to an image, it jumps from one point in pixel space to another, and it moves in a random direction. If we do that over and over, the point takes a random walk through pixel space. The path it takes looks like Brownian motion, which is what we see during real-life diffusion. A molecule gets randomly bumped and jostled by the molecules around it, so it follows a Brownian motion path — a random walk.

    Note that the analogy is imperfect, because we don’t actually add noise again and again to the same image during training. We add noise once, teach the network, and then go on to another image. So the point doesn’t actually follow a full Brownian motion path — it just takes one step along a random walk. Still, the math is similar, the people who invented this technique saw the obvious parallel with real-life diffusion, and the name stuck.

    So the “diffusion” happens during training, and what happens during image generation is actually akin to reverse diffusion. Diffusion running backwards in time.

  25. My son just prompted his AI provider to convert photos of our grandchildren to the style of Caravaggio and Gainsborough.

    The results are remarkable. I am reminded of the great caricaturists who worked for Mad Magazine.

  26. petrushka:

    My son just prompted his AI provider to convert photos of our grandchildren to the style of Caravaggio and Gainsborough.

    The results are remarkable. I am reminded of the great caricaturists who worked for Mad Magazine.

    It feels like magic, doesn’t it? That’s the reason I’m so obsessed with understanding how it all works.

  27. keiths: It feels like magic, doesn’t it?

    As someone who grew up in the 50s, it reminds me of all the promising technologies that were emerging.

    /oracle mode

  28. I forgot to mention it by AI- Artificial Intelligence locks ideas. I fed it false information on the causes of a disease but backed it up with reasonable data and it bought it…
    Even my professional profile can be manipulated…
    I laugh it!

Leave a Reply