Transformer Across Modalities
From text representation to multimodal generation β understand how Transformers adapt to text, image, audio, and video modalities. Recommended: complete the Transformer Core Mechanisms path first.
- 1
From Text to Vectors: Tokenization and Word Embeddings
Beginner#tokenization#embedding#word2vec#nlp - 2
BERT and GPT: Two Paths β Understanding vs Generation
Intermediate#bert#gpt#pretraining#nlp#nlu#classification#generation - 3
Sentence Embeddings: From Token-Level to Semantic Retrieval
Intermediate#sentence-embeddings#contrastive-learning#rag#retrieval#sbert - 4
Vision Transformer: When Images Become Token Sequences
Intermediate#vision-transformer#vit#image-recognition#computer-vision - 5
Multimodal Alignment: CLIP and Cross-Modal Embedding Spaces
Intermediate#clip#multimodal#contrastive-learning#zero-shot#vision-language - 6
Diffusion Model Fundamentals: Generating from Noise
Intermediate#diffusion#ddpm#generative-model#image-generation - 7
Diffusion Transformer: Image Generation with Transformers
Advanced#dit#diffusion#transformer#image-generation#stable-diffusion - 8
Video Generation: Spatiotemporal Attention and the Sora Architecture
Advanced#video-generation#sora#spatiotemporal-attention#dit#diffusion - 9
Speech and Transformers: From Whisper to VALL-E
Advanced#audio#speech#whisper#vall-e#tts#transformer - 10
Music Generation: When Transformers Learn to Compose
Advanced#music-generation#musicgen#jukebox#transformer#audio