Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Transformer Across Modalities

From text representation to multimodal generation — understand how Transformers adapt to text, image, audio, and video modalities. Recommended: complete the Transformer Core Mechanisms path first.

1

From Text to Vectors: Tokenization and Word Embeddings
Beginner

#tokenization#embedding#word2vec#nlp
2

BERT and GPT: Two Paths — Understanding vs Generation
Intermediate

#bert#gpt#pretraining#nlp#nlu#classification#generation
3

Sentence Embeddings: From Token-Level to Semantic Retrieval
Intermediate

#sentence-embeddings#contrastive-learning#rag#retrieval#sbert
4

Vision Transformer: When Images Become Token Sequences
Intermediate

#vision-transformer#vit#image-recognition#computer-vision
5

Multimodal Alignment: CLIP and Cross-Modal Embedding Spaces
Intermediate

#clip#multimodal#contrastive-learning#zero-shot#vision-language
6

Diffusion Model Fundamentals: Generating from Noise
Intermediate

#diffusion#ddpm#generative-model#image-generation
7

Diffusion Transformer: Image Generation with Transformers
Advanced

#dit#diffusion#transformer#image-generation#stable-diffusion
8

Video Generation: Spatiotemporal Attention and the Sora Architecture
Advanced

#video-generation#sora#spatiotemporal-attention#dit#diffusion
9

Speech and Transformers: From Whisper to VALL-E
Advanced

#audio#speech#whisper#vall-e#tts#transformer
10

Music Generation: When Transformers Learn to Compose
Advanced

#music-generation#musicgen#jukebox#transformer#audio