Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Transformer Across Modalities

From text representation to multimodal generation β€” understand how Transformers adapt to text, image, audio, and video modalities. Recommended: complete the Transformer Core Mechanisms path first.

  1. 1

    From Text to Vectors: Tokenization and Word Embeddings

    Beginner
    #tokenization#embedding#word2vec#nlp
  2. 2

    BERT and GPT: Two Paths β€” Understanding vs Generation

    Intermediate
    #bert#gpt#pretraining#nlp#nlu#classification#generation
  3. 3

    Sentence Embeddings: From Token-Level to Semantic Retrieval

    Intermediate
    #sentence-embeddings#contrastive-learning#rag#retrieval#sbert
  4. 4

    Vision Transformer: When Images Become Token Sequences

    Intermediate
    #vision-transformer#vit#image-recognition#computer-vision
  5. 5

    Multimodal Alignment: CLIP and Cross-Modal Embedding Spaces

    Intermediate
    #clip#multimodal#contrastive-learning#zero-shot#vision-language
  6. 6

    Diffusion Model Fundamentals: Generating from Noise

    Intermediate
    #diffusion#ddpm#generative-model#image-generation
  7. 7

    Diffusion Transformer: Image Generation with Transformers

    Advanced
    #dit#diffusion#transformer#image-generation#stable-diffusion
  8. 8

    Video Generation: Spatiotemporal Attention and the Sora Architecture

    Advanced
    #video-generation#sora#spatiotemporal-attention#dit#diffusion
  9. 9

    Speech and Transformers: From Whisper to VALL-E

    Advanced
    #audio#speech#whisper#vall-e#tts#transformer
  10. 10

    Music Generation: When Transformers Learn to Compose

    Advanced
    #music-generation#musicgen#jukebox#transformer#audio