transformers

Mastering Transformers

Mastering Transformers in 60 days, and becoming an AI Researcher.

Goal and resources.

Each word here was thought, not generated.

Table of Contents

Basics


Transformer’s vs LSTM’s



Encoder Architecture

encoder

Each Section Briefly Explained



Attention Details

This is the most relevant piece.

$ \text{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V $

Prerequisites:


Basics of Q,K,V Matrices

Q,K,V = X @ W_Q, X @ W_K, X @ W_V. n @ m in this case, means n projecting into m, or transforming n through m. Prefix W, refers to weights (trainable parameters).

X:

Q:

K: W_K, identity templates, X @ W_K projection is K, which uses W_K learned ability to determine searchable params/features to make it specific to X. k_$word then says, this are the things I can be matched against / searched for, which based on the ex above, it could be, “I can be searched when trying to find nouns, adjectives, connectors”, etc.

V: “As V_$word, this is the information I’ll contribute when someone pays attention to me”. e.g. W_V learns to extract the most useful information from each token - semantic meaning, grammatical role, contextual features, etc.


Formula Part 1

$(Q K^\top)$ - Q @ K^T

During training the model discovers that:

Walkthrough:

_input = "The car is red"

# Q_the = "As 'the', I need to find: the noun I'm modifying"
# Q_car = "As 'car', I need to find: my attributes, determiners, and related predicates" 

# K_the = "I can be matched as: determiner/article pattern"
# K_car = "I can be matched as: noun/subject/entity pattern"

# (used later)
# V_the = "I provide: definiteness, specificity signals"
# V_car = "I provide: vehicle semantics, subject-entity information"

similarity(Q_car, K_the) = HIGH   # car needs its determiner
similarity(Q_car, K_is)  = HIGH   # car needs its predicate  
similarity(Q_car, K_red) = HIGH   # car needs its attribute
similarity(Q_car, K_car) = LOW    # most tokens have low attention to themselves


attention_scores = Q @ K^T  # Shape: (4, 4)

attention_scores = [
  [0.1, 0.8, 0.3, 0.2],  # How much "The" wants to attend to [The, car, is, red]
  [0.2, 0.9, 0.7, 0.6],  # How much "car" wants to attend to [The, car, is, red]  
  [0.1, 0.8, 0.4, 0.5],  # How much "is" wants to attend to [The, car, is, red]
  [0.3, 0.9, 0.2, 0.1]   # How much "red" wants to attend to [The, car, is, red]
]


Formula Part 2

$ \mathrm{softmax}\left(\frac{Part 1}{\sqrt{d_k}}\right) V $


2.1. What the /√dk means?


2.2 Softmax Turns a bunch of values into a % that sum up to 1 (weighted sum). Pytorch documentation

Once we have Part1/√dk, we apply a softmax to get a weighted sum of % of attention each token should pay to each other, this are the attention_weights.

attention_weights meaning: as token “The”, “I want 40% of car info, 25% of red info, 35% of is info.


2.3. attention_weights @ V


Walkthrough:

attention_weights = softmax(attention_scores)  # Each row sums to 1.0

attention_weights = [
  [0.1, 0.5, 0.2, 0.2],  # "The" attention distribution  
  [0.1, 0.4, 0.3, 0.2],  # "car" attention distribution
  [0.1, 0.4, 0.2, 0.3],  # "is" attention distribution  
  [0.2, 0.6, 0.1, 0.1]   # "red" attention distribution
]

# skipping /square root of dk

# attention_weights @ V

V = [
  [0.1, 0.2, 0.3, 0.4],  # V_the: determiner information
  [1.0, 0.8, 0.5, 0.2],  # V_car: vehicle/noun information  
  [0.3, 0.9, 0.1, 0.3],  # V_is: linking-verb information
  [0.7, 0.2, 0.8, 1.0]   # V_red: color/adjective information
]

# softmax(Q * K^T) * V

The final result is: we mutate/move each token around the semantic space based on the other tokens it needs to attend to, updating each token to a deeper meaning of its representation in the whole input sequence.



Residual Layers (Add & Norm)

residual

Add

Norm


Position Wise FFN

ffn_pos_wise

So, wtf is Linearity



FAQ


Decoder Architecture


Positional Encoding & Embeddings

positional

so confusing


Quoting my explanation above:

Provide the attention layer with position of each token (in LSTM’s this information is known by default cause the proces is sequential, not in Transformers attention).

Is not because row order can’t be traced through the operations, but attention is computed in parallel and only from embeddings data, we need to insert the position in the embedding, so attention can understand it, e.g. noun comes before subject, or things like that.



Training


Inference



Interpretability and Visualizing


Arguing Architectural Decisions

Tasks