Diffusion Model Fundamentals: Generating from Noise
Updated 2026-04-12
Introduction: A New Paradigm in Generative Modeling
In 2020, Ho et al. proposed a counterintuitive generation method in Denoising Diffusion Probabilistic Models (DDPM): gradually corrupt data into pure noise, then train a neural network to reverse this process. This simple idea gave rise to Stable Diffusion, DALL-E, Midjourney, and a wave of products that transformed creative workflows.
The core insight of diffusion models can be stated in one sentence: if you know how to turn an image into noise, you can learn how to recover an image from noise. This “destroy-then-restore” framework gives diffusion models stable training and excellent generation quality, allowing them to rapidly surpass GANs as the dominant approach for image generation.
The Generative Model Family
Before diving into diffusion models, let’s review the major generative paradigms:
| Model | Core Idea | Strengths | Weaknesses |
|---|---|---|---|
| GAN | Generator vs. discriminator adversarial training | Fast generation, high quality | Unstable training, mode collapse |
| VAE | Encode to latent space, then decode | Stable training, probabilistic | Blurry outputs |
| Flow | Invertible transforms, exact likelihood | Exact density estimation | Architectural constraints, expensive |
| Diffusion | Iterative denoising | Stable training, highest quality | Slow sampling |
Diffusion models have a clean training objective (predict the noise), require no adversarial training, and impose no invertibility constraints on the architecture. This simplicity is a key factor in their success.
Forward Diffusion: Gradual Noise Addition
The forward process is a fixed Markov chain that adds small amounts of Gaussian noise at each step. After steps, any data distribution converges to a standard Gaussian distribution.
Mathematically, each step is defined as:
where is the noise schedule parameter. The crucial property is that we can jump directly from to any timestep :
where . This means training doesn’t require sequential noising — just randomly sample a timestep and directly compute , where .
Drag the slider to observe: at the image is intact (); as increases, the signal is progressively drowned by noise; at the image becomes pure noise ().
Reverse Denoising: Recovering from Noise
Generation in diffusion models is simply the reverse of the forward process: starting from pure noise , iteratively denoise until a clean image is recovered.
Each reverse step is parameterized as:
DDPM’s key insight: instead of predicting the denoised image directly, have the model predict the noise at the current timestep. The training loss is remarkably simple:
A straightforward MSE between the model’s predicted noise and the actual noise that was added.
Click “Next Step” to observe: at each step the model predicts the noise component (middle grid), then subtracts it from the current image to produce a cleaner result. After all steps, pure noise is restored into a coherent image.
Noise Schedules
The schedule of significantly impacts generation quality. The original DDPM used a linear schedule ( increasing linearly from 0.0001 to 0.02), but subsequent work (Nichol & Dhariwal, 2021) identified problems with this approach.
The linear schedule flaw: drops too quickly in early timesteps, meaning many sampling steps are “wasted” in regions where noise is already dominant and signal changes are minimal.
The cosine schedule adjusts the decay curve of to distribute signal-to-noise ratio changes more uniformly across the timeline, making every step count:
Hover over the curves to see exact values. The linear schedule (blue) causes to decay to near-zero within the first few hundred steps, while the cosine schedule (green) decays more gradually, utilizing every sampling step effectively.
U-Net Backbone
The denoising network in diffusion models typically uses a U-Net architecture. This encoder-decoder structure passes information across resolutions via skip connections, making it well-suited for extracting signal from noisy images.
Key design elements of the U-Net:
- Encoder: Progressive downsampling (64 to 32 to 16 to 8), increasing channel count, extracting high-level semantic features
- Decoder: Progressive upsampling, fusing encoder detail via skip connections
- Timestep embedding: is converted to a vector via sinusoidal positional encoding, injected into each ResBlock (analogous to Transformer positional encoding)
- Attention layers: Self-Attention at low-resolution layers (16x16, 8x8) captures global dependencies
Hover over each block for details. Notice how skip connections (orange dashed lines) link encoder and decoder at matching resolutions — these connections allow the decoder to directly access high-resolution features preserved by the encoder during reconstruction.
Conditional Generation and Classifier-Free Guidance
An unconditional diffusion model can only generate random images. To achieve text-guided generation (e.g., “a cat on the moon”), we need conditional generation.
Classifier-Free Guidance (CFG) is the dominant conditioning method today (Ho & Salimans, 2022). The key idea is to jointly train conditional and unconditional denoising: during training, the condition is randomly replaced with a null condition (e.g., 10% of the time). At inference, the difference between conditional and unconditional predictions is amplified to strengthen guidance:
where is the guidance scale:
- : equivalent to standard conditional generation — high diversity but may drift from the prompt
- : Stable Diffusion’s typical setting — balanced quality and diversity
- : closely follows the prompt but outputs become repetitive and oversaturated
Drag the slider to feel the effect of different guidance scales. At low guidance, shapes are diverse but unclear; at moderate guidance, quality peaks; at high guidance, results converge and become over-sharpened.
Accelerated Sampling: DDIM
DDPM’s main drawback is slow sampling — requiring 1000 sequential denoising steps. DDIM (Song et al., 2020) addresses this by reformulating the denoising process as a non-Markovian deterministic mapping, enabling dramatic reduction in sampling steps.
DDIM’s core modification uses a deterministic update rule:
Because the denoising process becomes deterministic, DDIM can use arbitrary subsequences of timesteps (e.g., only 50 or 20 steps) instead of traversing all 1000 steps. In practice, DDIM with 50 steps approaches DDPM 1000-step quality with a 20x speedup.
Subsequent ODE solvers like DPM-Solver and DPM-Solver++ further compress the step count to 10-25, dramatically improving diffusion model practicality.
Latent Diffusion: The Core of Stable Diffusion
Running diffusion directly in pixel space (e.g., 512x512x3) is computationally prohibitive. Latent Diffusion Models (LDM) (Rombach et al., 2022) innovate by running the diffusion process in a low-dimensional latent space.
LDM’s two-stage design:
- Stage one: Train a VAE (Variational Autoencoder) to compress images into low-dimensional latent representations. For example, a 512x512x3 image is encoded to a 64x64x4 latent code — a 64x spatial reduction
- Stage two: Train the diffusion model in this 64x64x4 latent space. All noising and denoising occurs in the latent space
The advantages are significant:
- Computational efficiency: 64x64 is 64x smaller than 512x512; attention computation drops from to
- Semantic quality: The VAE latent space already encodes semantic information, letting the diffusion model focus on semantic-level generation
- Modularity: The VAE and diffusion model can be trained and upgraded independently
Stable Diffusion is the canonical LDM implementation: a CLIP text encoder encodes text prompts, which are injected into the U-Net denoising process via cross-attention, and a VAE decoder converts the final latent back into an image.
Summary
Diffusion models decompose complex generation into a series of simple denoising steps through a “gradual noising then learning to denoise” framework. Key takeaways:
- Forward process: Fixed noising Markov chain; any timestep can be sampled in one step
- Training objective: Predict noise (simple MSE loss); no adversarial training needed
- Noise schedules: Cosine schedule is more efficient than linear, distributing SNR changes uniformly
- U-Net architecture: Encoder-decoder + skip connections + timestep injection
- CFG: Amplify conditional/unconditional difference for controllable generation
- DDIM: Deterministic sampling, compressing steps from 1000 down to 20-50
- Latent Diffusion: Run diffusion in latent space for dramatically improved efficiency
These fundamentals are prerequisites for understanding advanced topics — including the Diffusion Transformer (DiT) that replaces U-Net with Transformers, and extensions to video generation.