Music Generation: When Transformers Learn to Compose

Introduction: The Unique Challenges of Music Generation

In the previous article, we saw how Whisper and VALL-E use Transformers to handle speech. But music generation is a fundamentally different challenge:

Music spans a far wider frequency range than speech (20Hz-20kHz vs 85Hz-8kHz)
A typical song lasts 3-5 minutes, far longer than speech segments
Music contains multiple simultaneous instrument parts, not just a single voice
Musical structure (beats, chord progressions, sections) is entirely different from linguistic structure

These characteristics mean that directly transferring speech model approaches to music generation is not feasible. This article explores how MusicGen, Jukebox, MusicLM, and others address these challenges.

Music vs Speech: Signal Comparison

The complexity of music signals directly impacts modeling strategies: a wider frequency range demands finer spectral representation; longer duration significantly increases sequence length; multiple sources mean the model must capture harmonic relationships between instruments. These constraints drive specialized architecture designs.

MusicGen: Efficient Single-Stage Generation

MusicGen (Meta, 2023) is currently the most influential open-source music generation model. Its core innovation: using a single Transformer and a clever Delay Pattern to solve the multi-codebook parallel generation problem.

Delay Pattern: Multi-Codebook Interleaving

After encoding music with EnCodec, we get multiple codebook layers (typically 4). The traditional approach (“flat pattern”) requires generating all layer tokens at each timestep sequentially, needing $K$ separate models or $K$ forward passes.

MusicGen’s Delay Pattern offsets each codebook layer by one timestep: Layer 1 starts at $t=0$ , Layer 2 at $t=1$ , and so on. This allows all layers to be interleaved within a single Transformer, forming a diagonal fill pattern.

The elegance of the Delay Pattern lies in this: at any timestep $t$ , the different codebook tokens the model needs to predict come from different original time positions, avoiding causal dependency conflicts between layers and enabling a single autoregressive model to handle all codebook layers simultaneously.

MusicGen Generation Pipeline

MusicGen’s complete pipeline has four stages: text/melody condition input, T5 encoder processing text, Transformer decoder generating interleaved codec tokens, and EnCodec decoder converting tokens back to waveform.

Input Conditions

MusicGen supports three conditioning modes:

Text only: T5 encoder converts text descriptions (e.g., “upbeat electronic dance track”) into cross-attention conditions
Text + melody: Beyond text, a reference melody audio can be provided; the model maintains the melodic structure while matching the style described by text
Unconditional: Free generation without any conditions

Jukebox: Multi-Scale VQ-VAE Approach

Jukebox (OpenAI, 2020) was a pioneering work in music generation. It took a completely different approach:

Multi-scale VQ-VAE: Encodes raw audio into three discrete representation layers at different compression ratios (top/middle/bottom), with compression factors of 128x, 32x, and 8x respectively
Hierarchical Transformers: Generates autoregressively starting from the coarsest top level, then progressively upsamples to finer levels
Direct raw audio modeling: Works directly on 44.1kHz waveforms without Mel spectrograms or pretrained codecs

Jukebox can generate complete songs with lyrics, with impressive quality. But it has a critical weakness: extremely slow generation — producing one minute of audio requires approximately 9 hours of computation. This is because raw audio sequences are enormous (44100 x 60 = 2.6M samples/minute), remaining long even after VQ-VAE compression.

MusicLM and Stable Audio

MusicLM (Google, 2023)

MusicLM introduced MuLan — a music-text joint embedding model (analogous to CLIP for image-text). Its hierarchical generation strategy:

Generate MuLan semantic tokens from text description (high-level semantics)
Generate SoundStream acoustic tokens from semantic tokens (low-level details)
SoundStream decoder reconstructs the waveform

MusicLM demonstrated the potential of text-to-music alignment, but was not open-sourced due to training data copyright concerns.

Stable Audio (Stability AI, 2024)

Stable Audio took a fundamentally different route — Latent Diffusion Models:

Uses a VAE to compress audio into latent space, performing the diffusion process in that space
Introduces timing conditioning: the model can precisely control generated audio duration
Adopts a DiT (Diffusion Transformer) architecture instead of U-Net
Achieves text conditioning through a CLAP text encoder

This represents a paradigm shift in music generation from autoregressive models to diffusion models — consistent with the trend in image generation.

Evolution Timeline

From Jukebox’s brute-force modeling to MusicGen’s elegant design to Stable Audio’s diffusion paradigm, the music generation field has evolved rapidly. The two technical approaches (autoregressive vs diffusion) each have strengths: autoregressive models excel at temporal coherence, while diffusion models perform better in audio quality and diversity.

Frontiers and Challenges

Despite rapid progress, music generation still faces key challenges:

Long-range structure: Current models perform well within 30 seconds, but coherence and development across sections in full 3-5 minute tracks remain inadequate
Multi-track control: Users want independent control over drums, bass, melody, and other tracks, rather than generating only mixed audio
Copyright issues: Training data copyright attribution is a core barrier to commercialization
Evaluation metrics: Music quality assessment still relies heavily on subjective listening, lacking reliable automated metrics
Real-time generation: Interactive music creation requires low-latency generation, which current models struggle to achieve

Summary

Model	Approach	Core Tech	Strengths	Limitations
Jukebox	VQ-VAE + AR	Multi-scale quantization	Direct raw audio modeling	Extremely slow
MusicLM	Hierarchical AR	MuLan alignment	Strong semantic understanding	Not open-source
MusicGen	Single-stage AR	Delay Pattern	Efficient, controllable, open	30s limit
Stable Audio	Latent Diffusion	DiT + timing cond.	Good quality, flexible	High compute cost

Key takeaways:

Music modeling is an order of magnitude harder than speech: Wider frequency, longer duration, multi-source superposition require models to handle far more complex signals
Delay Pattern is an elegant engineering innovation: Transforms the multi-codebook problem into an interleaved sequence solvable by a single model, dramatically improving efficiency
AR and Diffusion each have strengths: Autoregressive preserves temporal coherence, diffusion delivers quality and diversity — the future may combine both
From 9 hours to real-time: The efficiency leap from Jukebox to MusicGen demonstrates the enormous impact of architecture design