Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Diffusion Transformer: Image Generation with Transformers

Diffusion Transformer: Image Generation with Transformers

Updated 2026-04-12

Introduction: From U-Net to Transformer

The backbone of diffusion models has long been the U-Net — a CNN encoder-decoder with skip connections. From DDPM to Stable Diffusion v1/v2, U-Net was the standard denoising network.

But experience from NLP tells us: Transformer scaling far surpasses CNN. In 2023, Peebles and Xie proposed the Diffusion Transformer (DiT) in their paper “Scalable Diffusion Models with Transformers”, replacing U-Net with a standard Transformer as the diffusion backbone. This seemingly simple substitution revealed an important finding: diffusion model quality scales continuously with Transformer compute, following scaling laws similar to LLMs.

DiT’s impact is far-reaching: OpenAI’s Sora video generation model adopts the DiT architecture (explicitly mentioned in OpenAI’s technical report), and Stability AI’s Stable Diffusion 3 uses MM-DiT, a DiT variant. Transformers are unifying generation across text, image, and video modalities.

U-Net Bottlenecks

U-Net achieved great success in diffusion models, but has inherent limitations:

  1. Resolution-bound: U-Net’s downsampling/upsampling paths are tightly coupled to input resolution. Changing generation resolution typically requires architecture modifications.

  2. Limited scaling: Increasing U-Net capacity mainly involves adding channels or ResNet blocks, but this scaling approach has diminishing returns beyond a certain scale.

  3. Insufficient global modeling: While U-Net includes Self-Attention layers (on low-resolution feature maps), CNN’s local inductive bias makes it less effective at long-range dependency modeling than Transformer.

  4. Inflexible conditioning: U-Net primarily injects text conditioning through cross-attention, with injection points scattered across different resolution levels.

These limitations prompted researchers to ask: can we directly replace U-Net with a Transformer?

DiT Architecture

DiT’s core idea is remarkably simple: treat the noisy latent as a token sequence and process it with a standard Transformer. This mirrors how ViT handles images — patchify first, then apply Transformer.

U-Net vs DiT: Architecture ComparisonU-Net (Traditional Diffusion)Noisy latent z_tEncoder 1Encoder 2Encoder 3BottleneckDecoder 1Decoder 2Decoder 3Skip ConnectionPredicted noise εCNN-based · Fixed resolution · Limited scalingDiT (Transformer Diffusion)Noisy latent z_tPatchifyConditioning (t, c)Transformer BlockTransformer BlockTransformer Block× N layersUnpatchifyPredicted noise εTransformer-based · Flexible · Scales with computeFeatureU-NetDiTBackboneCNN (Convolution)Transformer (Attention)ResolutionFixed (architecture-bound)Flexible (adjustable patch size)ScalingLimited (arch bottleneck)Continuous (more layers/heads)ConditioningCross-AttentionadaLN-Zero

Patchify, Process, Unpatchify

The complete DiT pipeline has four steps:

  1. Noisy latent: The input is a latent space representation ztz_t from a VAE encoder (32×32×4), with noise added per the diffusion schedule
  2. Patchify: Split the latent into p×pp \times p patches (p=2p=2 in the paper), yielding 322×322=256\frac{32}{2} \times \frac{32}{2} = 256 tokens, each linearly projected to dimension DD
  3. Transformer processing: NN DiT Blocks process the token sequence, with timestep tt and class label cc conditioning injected via adaLN-Zero
  4. Unpatchify: Rearrange processed tokens back to spatial dimensions, producing predicted noise ϵ^\hat{\epsilon}
Noisy Latent
Noisy Latent z_t32×32×4Patchify (2×2)→ 256 tokensDiT BlockadaLN-ZeroUnpatchify32×32×4× N layers[t, c] embedding32×32×4Input 32×32×4 noisy latent z_t (from VAE encoder + noise schedule)

Note that DiT operates in latent space, not pixel space. This is consistent with Latent Diffusion Models (LDM, the foundation of Stable Diffusion): a VAE first compresses images to a low-dimensional latent space, then diffusion happens in that latent space.

adaLN-Zero: Optimal Conditioning

The DiT paper explored four methods for injecting conditioning (timestep tt + class label cc) into the Transformer, finding adaLN-Zero to be the best.

What is adaLN-Zero?

Standard LayerNorm has learnable but fixed parameters γ\gamma and β\beta. Adaptive LayerNorm (adaLN) makes these parameters functions of the conditioning:

adaLN(h,y)=γ(y)LayerNorm(h)+β(y)\text{adaLN}(h, y) = \gamma(y) \odot \text{LayerNorm}(h) + \beta(y)

where yy is the conditioning embedding (timestep + class label processed by an MLP).

adaLN-Zero adds a gating mechanism: each Transformer block’s attention and FFN outputs have a learnable scaling parameter α\alpha, initialized to zero:

hh+α(y)Attention(adaLN(h,y))h \leftarrow h + \alpha(y) \odot \text{Attention}(\text{adaLN}(h, y))

Initializing α\alpha to zero means: at the start of training, every DiT block is an identity function (input equals output). This simple initialization trick significantly stabilizes training and is a key factor in DiT’s performance.

A single MLP regresses 6 parameters from the conditioning embedding yy: (γ1,β1,α1,γ2,β2,α2)(\gamma_1, \beta_1, \alpha_1, \gamma_2, \beta_2, \alpha_2), used for adaLN and gating in both the attention and FFN sub-layers.

adaLN-Zero MechanismConditioning InputTimestep tClass cMLP→ 6 parametersγ₁Scale & Shiftβ₁Scale & Shiftα₁GateInit = 0γ₂Scale & Shiftβ₂Scale & Shiftα₂GateInit = 0DiT BlockInput hLayerNormadaLN (γ₁, β₁) ← γ₁, β₁Self-Attention× α₁ (gate) ← α₁+ residualLayerNormadaLN (γ₂, β₂) ← γ₂, β₂FFN× α₂ (gate) ← α₂+ residualOutput h'adaLN(h, y) = γ(y) ⊙ LN(h) + β(y)h ← h + α(y) ⊙ Attn(adaLN(h, y))Key Insightα initialized to 0 → each block starts as identity functionα₁ = α₂ = 0 → h' = hConditioning MethodsIn-contextConcat to seqCross-AttnExtra attn layeradaLNModulate LNadaLN-ZeroadaLN + zero-init gate

Scaling Properties

DiT’s most important finding is that diffusion models follow Transformer scaling laws. The authors tested four model sizes on ImageNet 256×256 class-conditional generation:

ModelParamsLayersHidden DimAttn HeadsFID↓
DiT-S/233M12384668.4
DiT-B/2130M127681243.5
DiT-L/2458M2410241623.3
DiT-XL/2675M281152169.62 (w/ CFG)

Two key observations:

  1. Larger models yield lower FID: Performance improves monotonically with compute, showing no saturation
  2. More compute-efficient than U-Net: DiT-XL/2 achieves FID 9.62 at ~119 GFLOPs, while the U-Net baseline ADM requires 1120+ GFLOPs
DiT Scaling: FID vs Compute5102050100510501005001000DiT familyU-Net baselinesDiT-S/2DiT-B/2DiT-L/2DiT-XL/2LDM-4ADMADM-UGFLOPs (log scale)FID ↓ (lower is better)DiT familyU-Net baselinesDiT achieves comparable FID with less compute — Transformer scaling is more efficient

The significance of this scaling property: more compute consistently yields better generation quality. This mirrors the scaling behavior of the GPT series in language modeling, providing a clear path to building larger, more capable generative models.

MM-DiT: Stable Diffusion 3’s Dual-Stream Architecture

In 2024, Esser et al. proposed MM-DiT (Multimodal DiT) in the Stable Diffusion 3 paper “Scaling Rectified Flow Transformers for High-Resolution Image Synthesis”, extending DiT into a dual-stream architecture for multimodal processing.

Core Design

MM-DiT’s key innovation is dual streams + joint attention:

  • Two independent streams: Text tokens (from T5 and CLIP encoders) and image latent tokens (from VAE) each have independent embedding layers and MLPs
  • Joint attention: Within each MM-DiT block, tokens from both streams are concatenated for shared Self-Attention — text and image tokens interact in the same attention space
  • Separate MLPs: Attention is shared, but FFNs remain independent — preserving modality-specific feature transformation capabilities

This design is more powerful than simple cross-attention: text and image interact bidirectionally on equal footing, rather than only having images attend to text as in traditional approaches.

Dual Streams
Text StreamImage StreamT5 + CLIPVAE EncoderText TokensImage Latent TokensMM-DiT Block× N layersConcatConcatJoint AttentionShared Q·K·V spaceSplitText MLPImage MLPText TokensDiscardedPredicted noise εText tokens (from T5 + CLIP encoders) and image latent tokens (from VAE encoder) enter two independent streams

Additional Improvements

SD3/MM-DiT also introduces several important technical advances:

  • Rectified Flow: Uses straight-line trajectories instead of traditional diffusion paths, enabling faster sampling
  • QK-Normalization: Normalizes attention Q and K vectors for improved training stability
  • Multiple text encoders: Simultaneously uses CLIP-L, CLIP-G, and T5-XXL as text encoders

Summary

The Diffusion Transformer story can be summarized in one sentence: replace U-Net with Transformer, and diffusion models gain scaling laws.

Key takeaways:

  1. DiT replaces U-Net with Transformer as the denoising network: patchify → Transformer → unpatchify
  2. adaLN-Zero is the optimal conditioning method — adaptive LayerNorm with zero-initialized gating injects timestep and class conditions
  3. Scaling laws hold: Larger models and more compute yield continuously lower FID with no saturation
  4. MM-DiT extends DiT into a multimodal dual-stream architecture supporting deep text-image interaction
  5. DiT architecture has been adopted by Sora, SD3, and other frontier models, becoming the new standard backbone for generative models

From CNN to Transformer, diffusion models have embarked on the same scaling journey as LLMs. This means that in image and video generation, “bigger is better” applies just as well — Transformer once again proves its status as a universal compute engine.