Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Music Generation: When Transformers Learn to Compose

Music Generation: When Transformers Learn to Compose

Updated 2026-04-12

Introduction: The Unique Challenges of Music Generation

In the previous article, we saw how Whisper and VALL-E use Transformers to handle speech. But music generation is a fundamentally different challenge:

  • Music spans a far wider frequency range than speech (20Hz-20kHz vs 85Hz-8kHz)
  • A typical song lasts 3-5 minutes, far longer than speech segments
  • Music contains multiple simultaneous instrument parts, not just a single voice
  • Musical structure (beats, chord progressions, sections) is entirely different from linguistic structure

These characteristics mean that directly transferring speech model approaches to music generation is not feasible. This article explores how MusicGen, Jukebox, MusicLM, and others address these challenges.

Music vs Speech: Signal Comparison

Speech vs Music: Signal ComparisonSpeechWaveformSpectrogramMusicWaveformSpectrogramVSFeatureSpeechMusicFreq RangeNarrow (85-8kHz)Wide (20Hz-20kHz)Typical Duration~5s (sentence)~3min (song)SourcesSingle (voice)Multiple (instruments+voice)StructureLinguistic (phoneme/word/sentence)Musical (beat/chord/section)Modeling DifficultyModerateVery High

The complexity of music signals directly impacts modeling strategies: a wider frequency range demands finer spectral representation; longer duration significantly increases sequence length; multiple sources mean the model must capture harmonic relationships between instruments. These constraints drive specialized architecture designs.

MusicGen: Efficient Single-Stage Generation

MusicGen (Meta, 2023) is currently the most influential open-source music generation model. Its core innovation: using a single Transformer and a clever Delay Pattern to solve the multi-codebook parallel generation problem.

Delay Pattern: Multi-Codebook Interleaving

After encoding music with EnCodec, we get multiple codebook layers (typically 4). The traditional approach (“flat pattern”) requires generating all layer tokens at each timestep sequentially, needing KK separate models or KK forward passes.

MusicGen’s Delay Pattern offsets each codebook layer by one timestep: Layer 1 starts at t=0t=0, Layer 2 at t=1t=1, and so on. This allows all layers to be interleaved within a single Transformer, forming a diagonal fill pattern.

MusicGen Delay Pattern: Multi-Codebook InterleavingEach layer offset by 1 step, forming a diagonal. Single Transformer handles all.Codebook 1Codebook 2Codebook 3Codebook 4t=0t=1t=2t=3t=4t=5t=6t=7Fast: 1 model

The elegance of the Delay Pattern lies in this: at any timestep tt, the different codebook tokens the model needs to predict come from different original time positions, avoiding causal dependency conflicts between layers and enabling a single autoregressive model to handle all codebook layers simultaneously.

MusicGen Generation Pipeline

MusicGen’s complete pipeline has four stages: text/melody condition input, T5 encoder processing text, Transformer decoder generating interleaved codec tokens, and EnCodec decoder converting tokens back to waveform.

Input Conditions
Text description (+ optional melody audio) as generation conditionsText Description"Upbeat guitar melody"Melody (optional)Melody conditioningT5 EncoderText → EmbeddingsTransformerSingle-model AREnCodec DecoderTokens → WaveformKey: Single Transformer + Delay Pattern = Efficient multi-codebook generation

MusicGen supports three conditioning modes:

  • Text only: T5 encoder converts text descriptions (e.g., “upbeat electronic dance track”) into cross-attention conditions
  • Text + melody: Beyond text, a reference melody audio can be provided; the model maintains the melodic structure while matching the style described by text
  • Unconditional: Free generation without any conditions

Jukebox: Multi-Scale VQ-VAE Approach

Jukebox (OpenAI, 2020) was a pioneering work in music generation. It took a completely different approach:

  1. Multi-scale VQ-VAE: Encodes raw audio into three discrete representation layers at different compression ratios (top/middle/bottom), with compression factors of 128x, 32x, and 8x respectively
  2. Hierarchical Transformers: Generates autoregressively starting from the coarsest top level, then progressively upsamples to finer levels
  3. Direct raw audio modeling: Works directly on 44.1kHz waveforms without Mel spectrograms or pretrained codecs

Jukebox can generate complete songs with lyrics, with impressive quality. But it has a critical weakness: extremely slow generation — producing one minute of audio requires approximately 9 hours of computation. This is because raw audio sequences are enormous (44100 x 60 = 2.6M samples/minute), remaining long even after VQ-VAE compression.

MusicLM and Stable Audio

MusicLM (Google, 2023)

MusicLM introduced MuLan — a music-text joint embedding model (analogous to CLIP for image-text). Its hierarchical generation strategy:

  1. Generate MuLan semantic tokens from text description (high-level semantics)
  2. Generate SoundStream acoustic tokens from semantic tokens (low-level details)
  3. SoundStream decoder reconstructs the waveform

MusicLM demonstrated the potential of text-to-music alignment, but was not open-sourced due to training data copyright concerns.

Stable Audio (Stability AI, 2024)

Stable Audio took a fundamentally different route — Latent Diffusion Models:

  • Uses a VAE to compress audio into latent space, performing the diffusion process in that space
  • Introduces timing conditioning: the model can precisely control generated audio duration
  • Adopts a DiT (Diffusion Transformer) architecture instead of U-Net
  • Achieves text conditioning through a CLAP text encoder

This represents a paradigm shift in music generation from autoregressive models to diffusion models — consistent with the trend in image generation.

Evolution Timeline

Music Generation Model EvolutionAutoregressive (AR)DiffusionCommercial2020-04JukeboxOpenAI2023-01MusicLMGoogle2023-06MusicGenMeta2024-01Stable AudioStability AI2024+Udio / SunoCommercialHover for details

From Jukebox’s brute-force modeling to MusicGen’s elegant design to Stable Audio’s diffusion paradigm, the music generation field has evolved rapidly. The two technical approaches (autoregressive vs diffusion) each have strengths: autoregressive models excel at temporal coherence, while diffusion models perform better in audio quality and diversity.

Frontiers and Challenges

Despite rapid progress, music generation still faces key challenges:

  1. Long-range structure: Current models perform well within 30 seconds, but coherence and development across sections in full 3-5 minute tracks remain inadequate
  2. Multi-track control: Users want independent control over drums, bass, melody, and other tracks, rather than generating only mixed audio
  3. Copyright issues: Training data copyright attribution is a core barrier to commercialization
  4. Evaluation metrics: Music quality assessment still relies heavily on subjective listening, lacking reliable automated metrics
  5. Real-time generation: Interactive music creation requires low-latency generation, which current models struggle to achieve

Summary

ModelApproachCore TechStrengthsLimitations
JukeboxVQ-VAE + ARMulti-scale quantizationDirect raw audio modelingExtremely slow
MusicLMHierarchical ARMuLan alignmentStrong semantic understandingNot open-source
MusicGenSingle-stage ARDelay PatternEfficient, controllable, open30s limit
Stable AudioLatent DiffusionDiT + timing cond.Good quality, flexibleHigh compute cost

Key takeaways:

  1. Music modeling is an order of magnitude harder than speech: Wider frequency, longer duration, multi-source superposition require models to handle far more complex signals
  2. Delay Pattern is an elegant engineering innovation: Transforms the multi-codebook problem into an interleaved sequence solvable by a single model, dramatically improving efficiency
  3. AR and Diffusion each have strengths: Autoregressive preserves temporal coherence, diffusion delivers quality and diversity — the future may combine both
  4. From 9 hours to real-time: The efficiency leap from Jukebox to MusicGen demonstrates the enormous impact of architecture design