Music Generation: When Transformers Learn to Compose
Updated 2026-04-12
Introduction: The Unique Challenges of Music Generation
In the previous article, we saw how Whisper and VALL-E use Transformers to handle speech. But music generation is a fundamentally different challenge:
- Music spans a far wider frequency range than speech (20Hz-20kHz vs 85Hz-8kHz)
- A typical song lasts 3-5 minutes, far longer than speech segments
- Music contains multiple simultaneous instrument parts, not just a single voice
- Musical structure (beats, chord progressions, sections) is entirely different from linguistic structure
These characteristics mean that directly transferring speech model approaches to music generation is not feasible. This article explores how MusicGen, Jukebox, MusicLM, and others address these challenges.
Music vs Speech: Signal Comparison
The complexity of music signals directly impacts modeling strategies: a wider frequency range demands finer spectral representation; longer duration significantly increases sequence length; multiple sources mean the model must capture harmonic relationships between instruments. These constraints drive specialized architecture designs.
MusicGen: Efficient Single-Stage Generation
MusicGen (Meta, 2023) is currently the most influential open-source music generation model. Its core innovation: using a single Transformer and a clever Delay Pattern to solve the multi-codebook parallel generation problem.
Delay Pattern: Multi-Codebook Interleaving
After encoding music with EnCodec, we get multiple codebook layers (typically 4). The traditional approach (“flat pattern”) requires generating all layer tokens at each timestep sequentially, needing separate models or forward passes.
MusicGen’s Delay Pattern offsets each codebook layer by one timestep: Layer 1 starts at , Layer 2 at , and so on. This allows all layers to be interleaved within a single Transformer, forming a diagonal fill pattern.
The elegance of the Delay Pattern lies in this: at any timestep , the different codebook tokens the model needs to predict come from different original time positions, avoiding causal dependency conflicts between layers and enabling a single autoregressive model to handle all codebook layers simultaneously.
MusicGen Generation Pipeline
MusicGen’s complete pipeline has four stages: text/melody condition input, T5 encoder processing text, Transformer decoder generating interleaved codec tokens, and EnCodec decoder converting tokens back to waveform.
MusicGen supports three conditioning modes:
- Text only: T5 encoder converts text descriptions (e.g., “upbeat electronic dance track”) into cross-attention conditions
- Text + melody: Beyond text, a reference melody audio can be provided; the model maintains the melodic structure while matching the style described by text
- Unconditional: Free generation without any conditions
Jukebox: Multi-Scale VQ-VAE Approach
Jukebox (OpenAI, 2020) was a pioneering work in music generation. It took a completely different approach:
- Multi-scale VQ-VAE: Encodes raw audio into three discrete representation layers at different compression ratios (top/middle/bottom), with compression factors of 128x, 32x, and 8x respectively
- Hierarchical Transformers: Generates autoregressively starting from the coarsest top level, then progressively upsamples to finer levels
- Direct raw audio modeling: Works directly on 44.1kHz waveforms without Mel spectrograms or pretrained codecs
Jukebox can generate complete songs with lyrics, with impressive quality. But it has a critical weakness: extremely slow generation — producing one minute of audio requires approximately 9 hours of computation. This is because raw audio sequences are enormous (44100 x 60 = 2.6M samples/minute), remaining long even after VQ-VAE compression.
MusicLM and Stable Audio
MusicLM (Google, 2023)
MusicLM introduced MuLan — a music-text joint embedding model (analogous to CLIP for image-text). Its hierarchical generation strategy:
- Generate MuLan semantic tokens from text description (high-level semantics)
- Generate SoundStream acoustic tokens from semantic tokens (low-level details)
- SoundStream decoder reconstructs the waveform
MusicLM demonstrated the potential of text-to-music alignment, but was not open-sourced due to training data copyright concerns.
Stable Audio (Stability AI, 2024)
Stable Audio took a fundamentally different route — Latent Diffusion Models:
- Uses a VAE to compress audio into latent space, performing the diffusion process in that space
- Introduces timing conditioning: the model can precisely control generated audio duration
- Adopts a DiT (Diffusion Transformer) architecture instead of U-Net
- Achieves text conditioning through a CLAP text encoder
This represents a paradigm shift in music generation from autoregressive models to diffusion models — consistent with the trend in image generation.
Evolution Timeline
From Jukebox’s brute-force modeling to MusicGen’s elegant design to Stable Audio’s diffusion paradigm, the music generation field has evolved rapidly. The two technical approaches (autoregressive vs diffusion) each have strengths: autoregressive models excel at temporal coherence, while diffusion models perform better in audio quality and diversity.
Frontiers and Challenges
Despite rapid progress, music generation still faces key challenges:
- Long-range structure: Current models perform well within 30 seconds, but coherence and development across sections in full 3-5 minute tracks remain inadequate
- Multi-track control: Users want independent control over drums, bass, melody, and other tracks, rather than generating only mixed audio
- Copyright issues: Training data copyright attribution is a core barrier to commercialization
- Evaluation metrics: Music quality assessment still relies heavily on subjective listening, lacking reliable automated metrics
- Real-time generation: Interactive music creation requires low-latency generation, which current models struggle to achieve
Summary
| Model | Approach | Core Tech | Strengths | Limitations |
|---|---|---|---|---|
| Jukebox | VQ-VAE + AR | Multi-scale quantization | Direct raw audio modeling | Extremely slow |
| MusicLM | Hierarchical AR | MuLan alignment | Strong semantic understanding | Not open-source |
| MusicGen | Single-stage AR | Delay Pattern | Efficient, controllable, open | 30s limit |
| Stable Audio | Latent Diffusion | DiT + timing cond. | Good quality, flexible | High compute cost |
Key takeaways:
- Music modeling is an order of magnitude harder than speech: Wider frequency, longer duration, multi-source superposition require models to handle far more complex signals
- Delay Pattern is an elegant engineering innovation: Transforms the multi-codebook problem into an interleaved sequence solvable by a single model, dramatically improving efficiency
- AR and Diffusion each have strengths: Autoregressive preserves temporal coherence, diffusion delivers quality and diversity — the future may combine both
- From 9 hours to real-time: The efficiency leap from Jukebox to MusicGen demonstrates the enormous impact of architecture design