Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Diffusion Model Fundamentals: Generating from Noise

Diffusion Model Fundamentals: Generating from Noise

Updated 2026-04-12

Introduction: A New Paradigm in Generative Modeling

In 2020, Ho et al. proposed a counterintuitive generation method in Denoising Diffusion Probabilistic Models (DDPM): gradually corrupt data into pure noise, then train a neural network to reverse this process. This simple idea gave rise to Stable Diffusion, DALL-E, Midjourney, and a wave of products that transformed creative workflows.

The core insight of diffusion models can be stated in one sentence: if you know how to turn an image into noise, you can learn how to recover an image from noise. This “destroy-then-restore” framework gives diffusion models stable training and excellent generation quality, allowing them to rapidly surpass GANs as the dominant approach for image generation.

The Generative Model Family

Before diving into diffusion models, let’s review the major generative paradigms:

ModelCore IdeaStrengthsWeaknesses
GANGenerator vs. discriminator adversarial trainingFast generation, high qualityUnstable training, mode collapse
VAEEncode to latent space, then decodeStable training, probabilisticBlurry outputs
FlowInvertible transforms, exact likelihoodExact density estimationArchitectural constraints, expensive
DiffusionIterative denoisingStable training, highest qualitySlow sampling

Diffusion models have a clean training objective (predict the noise), require no adversarial training, and impose no invertibility constraints on the architecture. This simplicity is a key factor in their success.

Forward Diffusion: Gradual Noise Addition

The forward process is a fixed Markov chain that adds small amounts of Gaussian noise at each step. After TT steps, any data distribution converges to a standard Gaussian distribution.

Mathematically, each step is defined as:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \beta_t I)

where βt\beta_t is the noise schedule parameter. The crucial property is that we can jump directly from x0x_0 to any timestep xtx_t:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, (1 - \bar{\alpha}_t) I)

where αˉt=s=1t(1βs)\bar{\alpha}_t = \prod_{s=1}^{t} (1 - \beta_s). This means training doesn’t require sequential noising — just randomly sample a timestep tt and directly compute xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon, where ϵN(0,I)\epsilon \sim \mathcal{N}(0, I).

Forward Diffusion Processt=0: Original imaget=50: Pure noiseᾱₜ (signal retention)1.0000Signal: 100%Noise: 0%q(xₜ | x₀) = N(xₜ; √ᾱₜ · x₀, (1 − ᾱₜ) · I)013253850

Drag the slider to observe: at t=0t=0 the image is intact (αˉ01\bar{\alpha}_0 \approx 1); as tt increases, the signal is progressively drowned by noise; at t=Tt=T the image becomes pure noise (αˉT0\bar{\alpha}_T \approx 0).

Reverse Denoising: Recovering from Noise

Generation in diffusion models is simply the reverse of the forward process: starting from pure noise xTN(0,I)x_T \sim \mathcal{N}(0, I), iteratively denoise until a clean image x0x_0 is recovered.

Each reverse step is parameterized as:

pθ(xt1xt)=N(xt1;μθ(xt,t),σt2I)p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)

DDPM’s key insight: instead of predicting the denoised image directly, have the model predict the noise at the current timestep. The training loss is remarkably simple:

L=Et,x0,ϵ[ϵϵθ(xt,t)2]\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon}\left[\| \epsilon - \epsilon_\theta(x_t, t) \|^2\right]

A straightforward MSE between the model’s predicted noise ϵθ\epsilon_\theta and the actual noise ϵ\epsilon that was added.

Denoising Step: 0/10
Reverse Denoising ProcessPure noiseRecovered imageCurrent xₜεθ(xₜ, t)Predicted εθsubtractDenoised xₜ₋₁xₜ₋₁ = f(xₜ, εθ(xₜ, t))Model predicts noise, not the image itself

Click “Next Step” to observe: at each step the model predicts the noise component (middle grid), then subtracts it from the current image to produce a cleaner result. After all steps, pure noise is restored into a coherent image.

Noise Schedules

The schedule of βt\beta_t significantly impacts generation quality. The original DDPM used a linear schedule (βt\beta_t increasing linearly from 0.0001 to 0.02), but subsequent work (Nichol & Dhariwal, 2021) identified problems with this approach.

The linear schedule flaw: αˉt\bar{\alpha}_t drops too quickly in early timesteps, meaning many sampling steps are “wasted” in regions where noise is already dominant and signal changes are minimal.

The cosine schedule adjusts the decay curve of αˉt\bar{\alpha}_t to distribute signal-to-noise ratio changes more uniformly across the timeline, making every step count:

αˉt=f(t)f(0),f(t)=cos(t/T+s1+sπ2)2\bar{\alpha}_t = \frac{f(t)}{f(0)}, \quad f(t) = \cos\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right)^2

Noise Schedule Comparison0.000.250.500.751.0002004006008001000Timestep tᾱₜLinear ScheduleCosine ScheduleLinear wastes many steps in low-noise region

Hover over the curves to see exact values. The linear schedule (blue) causes αˉt\bar{\alpha}_t to decay to near-zero within the first few hundred steps, while the cosine schedule (green) decays more gradually, utilizing every sampling step effectively.

U-Net Backbone

The denoising network ϵθ\epsilon_\theta in diffusion models typically uses a U-Net architecture. This encoder-decoder structure passes information across resolutions via skip connections, making it well-suited for extracting signal from noisy images.

Key design elements of the U-Net:

  • Encoder: Progressive downsampling (64 to 32 to 16 to 8), increasing channel count, extracting high-level semantic features
  • Decoder: Progressive upsampling, fusing encoder detail via skip connections
  • Timestep embedding: tt is converted to a vector via sinusoidal positional encoding, injected into each ResBlock (analogous to Transformer positional encoding)
  • Attention layers: Self-Attention at low-resolution layers (16x16, 8x8) captures global dependencies
U-Net Denoising ArchitectureEncoder (Downsample)Decoder (Upsample)Skip ConnectionTimestep embedding tNoisy image xₜPredicted noise εθE164×64, 64chE232×32, 128chE316×16, 256chE48×8, 512chM8×8, 512chD48→16, 512chD316→32, 256chD232→64, 128chD164×64, 64chHover for details

Hover over each block for details. Notice how skip connections (orange dashed lines) link encoder and decoder at matching resolutions — these connections allow the decoder to directly access high-resolution features preserved by the encoder during reconstruction.

Conditional Generation and Classifier-Free Guidance

An unconditional diffusion model can only generate random images. To achieve text-guided generation (e.g., “a cat on the moon”), we need conditional generation.

Classifier-Free Guidance (CFG) is the dominant conditioning method today (Ho & Salimans, 2022). The key idea is to jointly train conditional and unconditional denoising: during training, the condition cc is randomly replaced with a null condition \varnothing (e.g., 10% of the time). At inference, the difference between conditional and unconditional predictions is amplified to strengthen guidance:

ϵ~θ=ϵθ(xt,)+s(ϵθ(xt,c)ϵθ(xt,))\tilde{\epsilon}_\theta = \epsilon_\theta(x_t, \varnothing) + s \cdot (\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \varnothing))

where ss is the guidance scale:

  • s=1s = 1: equivalent to standard conditional generation — high diversity but may drift from the prompt
  • s=78s = 7 \sim 8: Stable Diffusion’s typical setting — balanced quality and diversity
  • s>10s > 10: closely follows the prompt but outputs become repetitive and oversaturated
Classifier-Free Guidance EffectPrompt: "colorful geometric shapes"Sample 1Sample 2Sample 3Sample 4s ≈ 7: balanced qualityDiversityFidelitys=7.5ε̃ = ε(xₜ, ∅) + s · (ε(xₜ, c) − ε(xₜ, ∅))

Drag the slider to feel the effect of different guidance scales. At low guidance, shapes are diverse but unclear; at moderate guidance, quality peaks; at high guidance, results converge and become over-sharpened.

Accelerated Sampling: DDIM

DDPM’s main drawback is slow sampling — requiring 1000 sequential denoising steps. DDIM (Song et al., 2020) addresses this by reformulating the denoising process as a non-Markovian deterministic mapping, enabling dramatic reduction in sampling steps.

DDIM’s core modification uses a deterministic update rule:

xt1=αˉt1(xt1αˉtϵθ(xt,t)αˉt)+1αˉt1ϵθ(xt,t)x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \left(\frac{x_t - \sqrt{1 - \bar{\alpha}_t} \, \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}}\right) + \sqrt{1 - \bar{\alpha}_{t-1}} \, \epsilon_\theta(x_t, t)

Because the denoising process becomes deterministic, DDIM can use arbitrary subsequences of timesteps (e.g., only 50 or 20 steps) instead of traversing all 1000 steps. In practice, DDIM with 50 steps approaches DDPM 1000-step quality with a 20x speedup.

Subsequent ODE solvers like DPM-Solver and DPM-Solver++ further compress the step count to 10-25, dramatically improving diffusion model practicality.

Latent Diffusion: The Core of Stable Diffusion

Running diffusion directly in pixel space (e.g., 512x512x3) is computationally prohibitive. Latent Diffusion Models (LDM) (Rombach et al., 2022) innovate by running the diffusion process in a low-dimensional latent space.

LDM’s two-stage design:

  1. Stage one: Train a VAE (Variational Autoencoder) to compress images into low-dimensional latent representations. For example, a 512x512x3 image is encoded to a 64x64x4 latent code — a 64x spatial reduction
  2. Stage two: Train the diffusion model in this 64x64x4 latent space. All noising and denoising occurs in the latent space

The advantages are significant:

  • Computational efficiency: 64x64 is 64x smaller than 512x512; attention computation drops from O(5124)O(512^4) to O(644)O(64^4)
  • Semantic quality: The VAE latent space already encodes semantic information, letting the diffusion model focus on semantic-level generation
  • Modularity: The VAE and diffusion model can be trained and upgraded independently

Stable Diffusion is the canonical LDM implementation: a CLIP text encoder encodes text prompts, which are injected into the U-Net denoising process via cross-attention, and a VAE decoder converts the final latent back into an image.

Summary

Diffusion models decompose complex generation into a series of simple denoising steps through a “gradual noising then learning to denoise” framework. Key takeaways:

  • Forward process: Fixed noising Markov chain; any timestep can be sampled in one step
  • Training objective: Predict noise (simple MSE loss); no adversarial training needed
  • Noise schedules: Cosine schedule is more efficient than linear, distributing SNR changes uniformly
  • U-Net architecture: Encoder-decoder + skip connections + timestep injection
  • CFG: Amplify conditional/unconditional difference for controllable generation
  • DDIM: Deterministic sampling, compressing steps from 1000 down to 20-50
  • Latent Diffusion: Run diffusion in latent space for dramatically improved efficiency

These fundamentals are prerequisites for understanding advanced topics — including the Diffusion Transformer (DiT) that replaces U-Net with Transformers, and extensions to video generation.