Understanding Transformers with GPT-2 Code
import math
from dataclasses import dataclass
import torch
import torch.nn as nn
import torch.nn.functional as F
@dataclass
class GPTConfig:
vocab_size: int
block_size: int
n_layer: int = 12
n_head: int = 12
n_embd: int = 768
dropout: float = 0.1
class CausalSelfAttention(nn.Module):
def __init__(self, config: GPTConfig):
super().__init__()
assert config.n_embd % config.n_head == 0
self.n_head, self.n_embd = config.n_head, config.n_embd
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
self.c_proj = nn.Linear(config.n_embd, config.n_embd)
self.resid_drop = nn.Dropout(config.dropout)
self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size)).view(1, 1, config.block_size, config.block_size))
def forward(self, x):
B, T, C = x.size()
qkv = self.c_attn(x)
q, k, v = qkv.split(self.n_embd, dim=2)
head_dim = C // self.n_head
q = q.view(B, T, self.n_head, head_dim).transpose(1, 2)
k = k.view(B, T, self.n_head, head_dim).transpose(1, 2)
v = v.view(B, T, self.n_head, head_dim).transpose(1, 2)
att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(head_dim))
att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float("-inf"))
att = F.softmax(att, dim=-1)
y = att @ v
y = y.transpose(1, 2).contiguous().view(B, T, C)
return self.resid_drop(self.c_proj(y))
class MLP(nn.Module):
def __init__(self, config: GPTConfig):
super().__init__()
self.fc = nn.Linear(config.n_embd, 4 * config.n_embd)
self.proj = nn.Linear(4 * config.n_embd, config.n_embd)
self.drop = nn.Dropout(config.dropout)
def forward(self, x):
return self.drop(self.proj(F.gelu(self.fc(x))))
class Block(nn.Module):
def __init__(self, config: GPTConfig):
super().__init__()
self.ln_1 = nn.LayerNorm(config.n_embd)
self.attn = CausalSelfAttention(config)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.mlp = MLP(config)
def forward(self, x):
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return x
class GPT2(nn.Module):
def __init__(self, config: GPTConfig):
super().__init__()
self.config = config
self.wte = nn.Embedding(config.vocab_size, config.n_embd)
self.wpe = nn.Embedding(config.block_size, config.n_embd)
self.drop = nn.Dropout(config.dropout)
self.h = nn.ModuleList([Block(config) for _ in range(config.n_layer)])
self.ln_f = nn.LayerNorm(config.n_embd)
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
self.lm_head.weight = self.wte.weight
def forward(self, idx, targets=None):
B, T = idx.size()
pos = torch.arange(0, T, dtype=torch.long, device=idx.device).unsqueeze(0)
x = self.wte(idx) + self.wpe(pos)
x = self.drop(x)
for block in self.h:
x = block(x)
x = self.ln_f(x)
logits = self.lm_head(x)
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1)) if targets is not None else None
return logits, lossLet’s break down code segment by segment and understand transformers
1. Data Config
These are the knobs we turn to make the model bigger or smaller
@dataclass
class GPTConfig:
vocab_size: int
block_size: int # max sequence length (context window)
n_layer: int = 12 # Number of Transformer Blocks to stack
n_head: int = 12 # Number of attention "heads"
n_embd: int = 768 # The dimensionality of our vectors
dropout: float = 0.1vocab_size- Vocabulary size - How many unique words the model knows - 50,257block_size- Context window - How far back the model can “see” at once - 1024n_layer- Model depth - How many blocks are stacked, more layers → more powerful - 12n_head- Model width - How many parallel “conversations” attention can have - 12n_embd- Embedding dimension - The “size” of the vectors representing each token - 768
2. Word-Vector Dictionary(Token Embeddings)
we convert the raw input(array of token Ids) into vectors which can be processed by neural networks
class GPT2(nn.Module):
def __init__(self, config):
# We are building THIS line now.
self.wte = nn.Embedding(config.vocab_size, config.n_embd) # Word Token Embedding
self.wpe = nn.Embedding(...) # (Next chapter)
# The rest of the model
self.h = nn.ModuleList(...)
self.ln_f = nn.LayerNorm(...)
self.lm_head = nn.Linear(...)-
The input to our model is a tensor of token IDs, like
torch.tensor([[5, 21]]). These are categorical numbers. The ID21doesn’t have 4.2 times the “value” of ID5. The numerical distance between them is arbitrary and meaningless. A neural network, which relies on matrix multiplication and gradient descent, cannot learn from these raw IDs. They are just pointers -
Thus we need to convert them into vectors and map them into vector space(incase of GPT-2 it’s 768 dimension vector space) thus we can the vectors to understand
-
When they are converted into embeddings we can perform search on the existing vectors which is just a simple lookup table stored in a singled weighted matrix, this functionality is exposed as an in-built function in PyTorch with name
nn.Embedding(vocab_size, dimensions_n) -
Also when training we also provide a parameter called requires_grad=True
Despite all this one question still exists How can a token’s vector representation be dynamically adjusted based on its context? it means the meaning of the word “bank” is different in different sentences, to solve this problem we need to give the model a sense of order, that can be done by positional embeddings
3. Positional Embeddings
Providing vectors a sense of order, such that they can differentiate in meaning of the sentences like differentiating between “Dog bites man” and “Man bites dog”
class GPT2(nn.Module):
def __init__(self, config):
self.wte = nn.Embedding(...) # (Done)
# We are building THIS line now.
self.wpe = nn.Embedding(config.block_size, config.n_embd) # Positional Embedding
self.h = nn.ModuleList(...)
self.ln_f = nn.LayerNorm(...)
self.lm_head = nn.Linear(...)- Our current output is a tensor of shape
(B, T, C), whereCisn_embd. For a sequence like["Man", "bites", "dog"], the model receives a set of three vectors:{vector("Man"), vector("bites"), vector("dog")}. If we shuffled the input, the model would receive the exact same set of vectors, just in a different order along theTdimension. The core processing layers (the Transformer blocks) are designed to be order-invariant, so without modification, they would produce the same result. We need to explicitly “stamp” each token’s vector with its position - The solution used in GPT is wonderfully simple. how there is an unique vector for each word, we will also learn a unique vector for each position, to add this we just need to add an additional
nn.Embeddinglayer- We’ll have a vector that means “I am at the 1st position”
- We’ll have another vector that means “I am at the 2nd position”
- …and so on, up to the maximum sequence length (
block_size)
- From this, first we get the words from nearest neighbors and from them we figure out the order with the help of positional embeddings
- It works because the token and positional embeddings exist in the same high-dimensional space, the model can learn to interpret the combined vector. During training, it learns to create positional vectors such that adding
vector(pos=N)tovector(word=W)produces a unique representation that distinguishes it from the same word at a different position. The network learns to “understand” this composition
Q What if the input sequence is shorter than block_size?
A This is the normal case! As you see above, our block_size is 8, but our input sequence length T is only 5. The code torch.arange(0, T, ...) handles this perfectly. We only generate and look up the positional embeddings for the sequence length we are currently processing. We never use the full block_size unless our input is that long
4. Self Attention
we established that our model gives the same starting vector to a word regardless of its context. This is a problem for ambiguous words. for example consider a word “Crane”, the meaning of crane is different in different sentences like “Crane ate a fish” and “Crane lifted steel”, here the starting vector is identical for both the sentences, we need to update this vector based on the neighbors
The most important component in Transformer is Causal Self Attention, it’s all governed by this one formula
This entire process of using the formula to solve this problem happens in three steps: Scoring, Normalizing and Aggregating
- Here the Variables
Q,KandVare three distinct vectors for each single word, created by projecting input vectorx. Thus this specify as followsQuery(Q): The word’s “search query”. What it’s looking forKey(K): The word’s “label” or “keyword”. What it isValue(V): The word’s “payload”. The information it offers. We use this value despite having information ofxbecause the raw information is not always the best information to share. TheVvector is a transformed version ofx, specifically packaged for other tokens to consume
Let’s run the self attention on the above “Crane” example. For now let’s consider 2D space
- Dimension 1: Represents “Is it an Animal?”
- Dimension 2: Represents “Is it a Machine?”
The ambiguous word “crane” will have vectors balanced between these possibilities
| Token | Q - “I’m looking for…” | K - “I am…” | V - “I offer this info…“ |
|---|---|---|---|
| ate | … | [0.9, 0.1] (High Animal) | [0.9, 0.1] |
| fish | … | [0.9, 0.1] (High Animal) | [0.8, 0.2] |
| lifted | … | [0.1, 0.9] (High Machine) | [0.1, 0.9] |
| steel | … | [0.1, 0.9] (High Machine) | [0.2, 0.8] |
| crane | [0.7, 0.7] | [0.7, 0.7] | [0.5, 0.5] (Ambiguous) |
| Sentence 1: “Crane ate fish” |
- Scoring (
QK^T): Thecranetoken uses its query[0.7, 0.7]to probe all keys in the sentence.- Score(
crane→crane):[0.7, 0.7] ⋅ [0.7, 0.7]= 0.49 + 0.49 = 0.98 - Score(
crane→ate):[0.7, 0.7] ⋅ [0.9, 0.1]= 0.63 + 0.07 = 0.70 - Score(
crane→fish):[0.7, 0.7] ⋅ [0.9, 0.1]= 0.63 + 0.07 = 0.70
- Score(
- Normalizing (
softmax): The raw scores[0.98, 0.70, 0.70]are converted to percentages.- Attention Weights:
[0.4, 0.3, 0.3]This meanscranewill construct its new self by listening 40% to its original self, 30% toate, and 30% tofish.
- Attention Weights:
- Aggregating (
...V): The new vector forcraneis a weighted sum of the Values.New_Vector(crane)=0.4*V(crane)+0.3*V(ate)+0.3*V(fish)New_Vector(crane)=0.4*[0.5, 0.5]+0.3*[0.9, 0.1]+0.3*[0.8, 0.2]New_Vector(crane)=[0.20, 0.20]+[0.27, 0.03]+[0.24, 0.06]=[0.71, 0.29]
The result is a new “crane” vector that is heavily skewed towards Dimension 1 (Animal). The context from ate and fish has resolved the ambiguity. It’s a bird
Sentence 2: “Crane lifted steel”
- Scoring (
QK^T):craneuses the exact same query[0.7, 0.7]on its new neighbors.- Score(
crane→crane):[0.7, 0.7] ⋅ [0.7, 0.7]= 0.98 - Score(
crane→lifted):[0.7, 0.7] ⋅ [0.1, 0.9]= 0.07 + 0.63 = 0.70 - Score(
crane→steel):[0.7, 0.7] ⋅ [0.1, 0.9]= 0.07 + 0.63 = 0.70
- Score(
- Normalizing (
softmax): The raw scores[0.98, 0.70, 0.70]are identical to before.- Attention Weights:
[0.4, 0.3, 0.3]The percentages are the same, but they now apply to a different set of tokens!
- Attention Weights:
- Aggregating (
...V):New_Vector(crane)=0.4*V(crane)+0.3*V(lifted)+0.3*V(steel)New_Vector(crane)=0.4*[0.5, 0.5]+0.3*[0.1, 0.9]+0.3*[0.2, 0.8]New_Vector(crane)=[0.20, 0.20]+[0.03, 0.27]+[0.06, 0.24]=[0.29, 0.71]
The result is a vector now heavily skewed towards Dimension 2 (Machine). The exact same initial “crane” vector has been transformed into a completely different, context-aware vector because it listened to different dominant neighbors Now that the intuition is solid, we can finally implement it with matrices
5. Scaled Dot-Product Attention
We have everything, the vectors, the context and meaning between the context, now we need to translate it into the language of Linear Algebra for efficient matrix operations using PyTorch, by implementing the core attention formula and encapsulated it into a reusable nn.Module. We do it in two parts
1. Building raw tensors to see every number
In the sentence “A crane ate fish”. We now have 4 tokens (T=4) and our toy embedding dimension is 2 (C=2). We’ll process one sentence at a time (B=1)
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from dataclasses import dataclass
B, T, C = 1, 4, 2 # Batch, Time (sequence length), Channels (embedding dim)
x = torch.tensor([
[[0.1, 0.1], # A
[1.0, 0.2], # crane (mostly object, slightly action)
[0.1, 0.9], # ate (mostly action)
[0.8, 0.0]] # fish (purely object)
]).float()The Input have a raw, context free embeddings, this is the tensor from our embedding layers. Dim1=“Object-like”, Dim2=“Action-like”
- Projecting
xinto Q, K, and V- To get our Query, Key, and Value vectors, we use learnable linear transformations. These
nn.Linearlayers are the “brains” of the operation; their weights are updated during training
- To get our Query, Key, and Value vectors, we use learnable linear transformations. These
# The learnable components
q_proj = nn.Linear(C, C, bias=False)
k_proj = nn.Linear(C, C, bias=False)
v_proj = nn.Linear(C, C, bias=False)
# Manually set weights for this tutorial
torch.manual_seed(42)
q_proj.weight.data = torch.randn(C, C)
k_proj.weight.data = torch.randn(C, C)
v_proj.weight.data = torch.randn(C, C)
# --- Perform the projections ---
q = q_proj(x)
k = k_proj(x)
v = v_proj(x)Tracking tensor shapes and their meaning
| Variable | Shape (B, T, C) | Meaning |
|---|---|---|
x | (1, 4, 2) | The batch of raw input vectors. |
q | (1, 4, 2) | The “Query” vector for each of the 4 tokens. |
k | (1, 4, 2) | The “Key” vector for each of the 4 tokens. |
v | (1, 4, 2) | The “Value” vector for each of the 4 tokens. |
- Calculate Attention Scores (
q @ k.transpose)- This is the core of the communication. We need to compute the dot product of every token’s query with every other token’s key. We can do this with a single, efficient matrix multiplication
qhas shape(1, 4, 2)khas shape(1, 4, 2)- To multiply them, we need to make their inner dimensions match. We use
.transpose(-2, -1)to swap the last two dimensions ofk k.transpose(-2, -1)results in a shape of(1, 2, 4)- The multiplication is
(1, 4, 2) @ (1, 2, 4), which results in a(1, 4, 4)matrix
- This is the core of the communication. We need to compute the dot product of every token’s query with every other token’s key. We can do this with a single, efficient matrix multiplication
# --- Score Calculation ---
scores = q @ k.transpose(-2, -1)
print("--- Raw Scores (Attention Matrix) ---")
print(scores.shape)
print(scores)Output:
--- Raw Scores (Attention Matrix) ---
torch.Size([1, 4, 4])
tensor([[[ 0.0531, 0.4137, 0.1802, 0.2721], # "A" scores for (A, crane, ate, fish)
[ 0.1782, 1.3888, 0.6053, 0.9101], # "crane" scores for (A, crane, ate, fish)
[ 0.0618, 0.4815, 0.2098, 0.3151], # "ate" scores for (A, crane, ate, fish)
[ 0.1260, 0.9822, 0.4280, 0.6433]]]) # "fish" scores for (A, crane, ate, fish)This (4, 4) matrix holds the raw compatibility scores. For example, the query for “crane” (row 1) has the highest compatibility with the key for “crane” (column 1), which is 1.3888
- Scale and SoftMax
- We scale the scores for stability, then use
softmaxto turn them into attention weights that sum to 1 for each row
- We scale the scores for stability, then use
d_k = k.size(-1)
scaled_scores = scores / math.sqrt(d_k)
attention_weights = F.softmax(scaled_scores, dim=-1) # Softmax along the rows- Aggregate the Values (
attention_weights @ v)- Now we use our weights to create a weighted average of the
Valuevectorsattention_weightshas shape(1, 4, 4)vhas shape(1, 4, 2)- The multiplication
(1, 4, 4) @ (1, 4, 2)produces a final tensor of shape(1, 4, 2)
- Now we use our weights to create a weighted average of the
# --- Value Aggregation ---
output = attention_weights @ v
print("\n--- Final Output (Context-Aware Vectors) ---")
print(output.shape)
print(output)Output:
--- Final Output (Context-Aware Vectors) ---
torch.Size([1, 4, 2])
tensor([[[ 0.0652, -0.1691],
[ 0.1147, -0.2974],
[ 0.0768, -0.1991],
[ 0.1005, -0.2607]]])Gist of the tensor transformations done above:
| Step | Operation | Input Shapes | Output Shape (B, T, ...) | Meaning |
|---|---|---|---|---|
| 1 | q_proj(x) etc. | (1, 4, 2) | (1, 4, 2) | Create Q, K, V for each token |
| 2 | q @ k.T | (1, 4, 2) & (1, 2, 4) | (1, 4, 4) | Raw compatibility scores |
| 3 | / sqrt(d_k) | (1, 4, 4) | (1, 4, 4) | Stabilized scores |
| 4 | softmax | (1, 4, 4) | (1, 4, 4) | Attention probabilities |
| 5 | att @ v | (1, 4, 4) & (1, 4, 2) | (1, 4, 2) | Context-aware output vectors |
Now we have taken our raw input x and produced a new tensor output of the exact same shape, where each token’s vector has been updated with information from its neighbors |
2. Encapsulating the Logic in an nn.Module
Here is the complete, encapsulated code for a single attention head
class SingleHeadSelfAttention(nn.Module):
def __init__(self, config):
"""
Initializes the layers needed for self-attention.
"""
super().__init__()
# The single, fused linear layer for Q, K, V
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=False)
def forward(self, x):
"""
Defines the data flow through the module.
Input x shape: (B, T, C)
"""
B, T, C = x.size()
# 1. Get Q, K, V from a single projection and split them
qkv = self.c_attn(x)
q, k, v = qkv.split(C, dim=2)
# 2. Calculate attention weights
# (B, T, C) @ (B, C, T) -> (B, T, T)
scaled_scores = (q @ k.transpose(-2, -1)) / math.sqrt(k.size(-1))
attention_weights = F.softmax(scaled_scores, dim=-1)
# 3. Aggregate values
# (B, T, T) @ (B, T, C) -> (B, T, C)
output = attention_weights @ v
return outputThe __init__ method sets up the building blocks. Here, we only need one
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=False)Here we used three separate nn.Linear layers. This single line is a common and highly efficient optimization that achieves the same goal
| Our Manual Walkthrough (Conceptually Clear) | Fused Layer (Computationally Efficient) |
|---|---|
q_proj = nn.Linear(C, C) | |
k_proj = nn.Linear(C, C) | c_attn = nn.Linear(C, 3*C) |
v_proj = nn.Linear(C, C) | |
Instead of three smaller matrix multiplications, the GPU can perform one larger, faster matrix multiplication. The bias=False argument is a common simplification used in minimal implementations like NanoGPT. Note that the original GPT-2 implementation does include biases in its linear projections |
The Forward Method
- Projection and Splitting
qkv = self.c_attn(x)
q, k, v = qkv.split(C, dim=2)self.c_attn(x): We pass our inputx(shapeB, T, C) through the fused layer, resulting in aqkvtensor of shape(B, T, 3*C).qkv.split(C, dim=2): This is the clever part. The.split()function carves up the tensor. We tell it: “Along dimension 2 (the last dimension), create chunks of sizeC.” Since the total dimension is3*C, this gives us exactly three tensors, each with the desired shape of(B, T, C), which we assign toq,k, andv
- Calculating Attention Weights
scaled_scores = (q @ k.transpose(-2, -1)) / math.sqrt(k.size(-1))
attention_weights = F.softmax(scaled_scores, dim=-1)This is a direct, one-to-one implementation of the mathematical formula.
k.transpose(-2, -1)swaps theTandCdimensions of the Key tensor to prepare for matrix multiplicationq @ ...performs the dot product, resulting in the raw score matrix of shape(B, T, T)/ math.sqrt(k.size(-1))performs the scaling for stabilityF.softmax(...)converts the raw scores into a probability distribution along each row
- Aggregating Values
output = attention_weights @ vFinally, we perform the last matrix multiplication. The attention weights (B, T, T) are multiplied with the Value vectors (B, T, C), resulting in our final output tensor of shape (B, T, C)
Proof of Equivalence:
To prove this class is identical to our manual work, we can instantiate it and manually load the weights from our q_proj, k_proj, and v_proj layers into the single c_attn layer
@dataclass
class GPTConfig: n_embd: int
model = SingleHeadSelfAttention(GPTConfig(n_embd=C))
# The c_attn layer's weight matrix is shape (3*C, C). Our separate weights
# are each (C, C). We concatenate them along dim=0 to get (3*C, C).
model.c_attn.weight.data = torch.cat(
[q_proj.weight.data, k_proj.weight.data, v_proj.weight.data], dim=0
)
# Run the model
model_output = model(x)
# 'output' is the tensor from our manual walkthrough in Part 1
print("Are the outputs the same?", torch.allclose(output, model_output))Output:
Are the outputs the same? True
However, our model has a flaw for language generation: tokens can see into the future. Our current attention matrix allows this. We can fix this by adding a causal mask
6. Casual Masking
Now at this point there is an attention mechanism that allows tokens to communicate, but there is a flaw, the tokens can communicate with the tokens that are going to generate in the future, but we are trying to build an Autoregressive Model it means it generates text one token at a time, and also the next output token should only depend upon the already generated tokens, but not on the tokens that comes after the current token
For example, here in previous output
tensor([[[0.37, 0.32, 0.31, ...], # "A" attends to all 4 tokens
[0.31, 0.37, 0.32, ...], # "crane" attends to all 4 tokens
[0.36, 0.31, 0.33, ...], # "ate" attends to all 4 tokens
... # "fish" attends to all 4 tokens
]]])we can see that token “A” is gathering information from “Crane”, “ate” and “fish”. But it should not happen. To solve this problem we use Casual mask. It means, we will modify the attention score matrix before applying the SoftMax function, means we will “mask out” all the future positions by setting their scores to negative infinity(-inf)
We use -inf because the software function involves an exponential e^x / sum(e^x) and the exponential of negative infinity, e^-inf, is effectively zero. This forces the attention weights for all future tokens to become 0, preventing any information flow
1. Applying Operations on Raw Tensors
Previously we got vectors like this
# This is the scaled_scores tensor from the end of the last chapter
# Shape (B, T, T) -> (1, 4, 4)
scaled_scores = torch.tensor([[
[ 0.0375, 0.2925, 0.1274, 0.1924],
[ 0.1260, 0.9822, 0.4280, 0.6433],
[ 0.0437, 0.3405, 0.1484, 0.2228],
[ 0.0891, 0.6945, 0.3023, 0.4549]
]])- Creating the Mask
- We need a mask that allows a token to see itself and the past, but not the future. A lower-triangular matrix is perfect for this. We can create one easily with
torch.tril
- We need a mask that allows a token to see itself and the past, but not the future. A lower-triangular matrix is perfect for this. We can create one easily with
# T=4 for our sentence "A crane ate fish"
T = 4
mask = torch.tril(torch.ones(T, T))
print("--- The Mask ---")
print(mask)Output:
--- The Mask ---
tensor([[1., 0., 0., 0.],
[1., 1., 0., 0.],
[1., 1., 1., 0.],
[1., 1., 1., 1.]])- Row 0 (“A”) can only see column 0 (“A”)
- Row 1 (“crane”) can see column 0 (“A”) and 1 (“crane”)
- And so on. The zeros in the upper-right triangle represent the “future” connections that we must block
- Applying the Mask
- By using PyTorch function
masked_fillto apply our mask. This function will replace all values inscaled_scoreswith-infwherever the corresponding position in ourmaskis0
- By using PyTorch function
masked_scores = scaled_scores.masked_fill(mask == 0, float('-inf'))
print("\n--- Scores After Masking ---")
print(masked_scores)Output:
--- Scores After Masking ---
tensor([[[ 0.0375, -inf, -inf, -inf],
[ 0.1260, 0.9822, -inf, -inf],
[ 0.0437, 0.3405, 0.1484, -inf],
[ 0.0891, 0.6945, 0.3023, 0.4549]]])- Running SoftMax function again
attention_weights = F.softmax(masked_scores, dim=-1)
print("\n--- Final Causal Attention Weights ---")
print(attention_weights.data.round(decimals=2))Output:
--- Final Causal Attention Weights ---
tensor([[[1.0000, 0.0000, 0.0000, 0.0000],
[0.2995, 0.7005, 0.0000, 0.0000],
[0.3129, 0.3807, 0.3064, 0.0000],
[0.2186, 0.3999, 0.2445, 0.1370]]])From this output we can derive that
- “A” can only attend to itself (100%)
- “crane” attends to “A” (30%) and “crane” (70%)
- “ate” attends to “A”, “crane”, and “ate”. Information can now only flow from the past to the present
| Attention Type | ”crane” attends to “fish”? | “ate” attends to “fish”? |
|---|---|---|
| Unmasked (Ch 5) | Yes | Yes |
| Causal (Ch 6) | No (0%) | No (0%) |
2. Encapsulating in the nn.Module
class CausalSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
# ... (c_attn layer from before)
# We register the mask as a "buffer"
self.register_buffer(
"bias", # name of the buffer
torch.tril(torch.ones(config.block_size, config.block_size))
.view(1, 1, config.block_size, config.block_size)
)In the __init__ Method the register_buffer We need to store our mask as part of the module. We use register_buffer for this
#Q Why register_buffer?
A A buffer is a tensor that is part of the model’s state (like weights), so it gets moved to the GPU with .to(device). However, it is not a parameter that gets updated by the optimizer during training
The .view(1, 1, ...) part is to add extra dimensions for broadcasting, which will be essential for Multi-Head Attention
We add this masking step to forward() function
def forward(self, x):
B, T, C = x.size()
# ... (get q, k, v as before)
scaled_scores = (q @ k.transpose(-2, -1)) / math.sqrt(k.size(-1))
# --- THE NEW LINE ---
# We slice the stored mask to match the sequence length T of our input
scaled_scores = scaled_scores.masked_fill(self.bias[:, :, :T, :T] == 0, float("-inf"))
attention_weights = F.softmax(scaled_scores, dim=-1)
output = attention_weights @ v
return outputUntil now everything is good, we have context, vectors at place with their meaning attached and only dependent on generated tokens but one draw back we have is only one conversation can be held at a time, it means everything that is essential is done by one thing, to solve this we use Multi-Head Attention
7. Multi-Head Attention
What if, instead of one overworked attention mechanism, we could have several working in parallel? This is the core idea of Multi-Head Attention. We will split our embedding dimension C into smaller chunks, called “heads”. Each head will be its own independent attention mechanism, complete with its own Q, K, and V projections
- Head 1 might learn to focus on verb-object relationships
- Head 2 might learn to focus on which pronouns refer to which nouns
- Head 3 might learn to track long-range dependencies in the text
- …and so on
Each head conducts its own “conversation” and produces its own context-aware output vector. At the end, we simply concatenate the results from all the heads and pass them through a final linear layer to combine the insights
1. Applying Operations on Raw Tensor
Let’s start with our q, k, and v tensors from Step 5
- Shape:
(B, T, C)→(1, 4, 768)(Let’s use a more realisticCfor this example) n_head: Let’s say we want12attention headshead_dim: The dimension of each head will beC / n_head, which is768 / 12 = 64
- Splitting
Cinton_headandhead_dimOur currentqtensor has shape(1, 4, 768). We need to reshape it so that the 12 heads are explicit. The target shape is(B, n_head, T, head_dim)or(1, 12, 4, 64)
This is done with a sequence ofview()andtranspose()operations
# --- Configuration ---
B, T, C = 1, 4, 768
n_head = 12
head_dim = C // n_head # 768 // 12 = 64
# --- Dummy Q, K, V tensors with realistic shapes ---
q = torch.randn(B, T, C)
k = torch.randn(B, T, C)
v = torch.randn(B, T, C)
# --- Reshaping Q ---
# 1. Start with q: (B, T, C) -> (1, 4, 768)
# 2. Reshape to add the n_head dimension
q_reshaped = q.view(B, T, n_head, head_dim) # (1, 4, 12, 64)
# 3. Transpose to bring n_head to the front
q_final = q_reshaped.transpose(1, 2) # (1, 12, 4, 64)
print("Original Q shape:", q.shape)
print("Final reshaped Q shape:", q_final.shape)Output:
Original Q shape: torch.Size([1, 4, 768])
Final reshaped Q shape: torch.Size([1, 12, 4, 64])
We do the exact same reshaping for k and v. Now, PyTorch’s broadcasting capabilities will treat the n_head dimension as a new “batch” dimension. All our subsequent attention calculations will be performed independently for all 12 heads at once
- Run Attention in Parallel Our attention formula remains the same, but now it operates on tensors with an extra
n_headdimension
# Reshape k and v as well
k_final = k.view(B, T, n_head, head_dim).transpose(1, 2) # (1, 12, 4, 64)
v_final = v.view(B, T, n_head, head_dim).transpose(1, 2) # (1, 12, 4, 64)
# --- Attention Calculation ---
# (B, nh, T, hd) @ (B, nh, hd, T) -> (B, nh, T, T)
scaled_scores = (q_final @ k_final.transpose(-2, -1)) / math.sqrt(head_dim)
# (We would apply the causal mask here)
attention_weights = F.softmax(scaled_scores, dim=-1)
# (B, nh, T, T) @ (B, nh, T, hd) -> (B, nh, T, hd)
output_per_head = attention_weights @ v_final
print("Shape of output from each head:", output_per_head.shape)Output:
Shape of output from each head: torch.Size([1, 12, 4, 64])
We now have a (64-dimensional) output vector for each of our 4 tokens, from each of our 12 heads
- Merging the Heads The last step is to combine the insights from all 12 heads. We do this by reversing the reshape operation: we concatenate the heads back together into a single
C-dimensional vector and then pass it through a final linear projection layer (c_proj)
# 1. Transpose and reshape to merge the heads back together
# (B, nh, T, hd) -> (B, T, nh, hd)
merged_output = output_per_head.transpose(1, 2).contiguous()
# The .contiguous() is needed because transpose can mess with memory layout.
# It creates a new tensor with the elements in the correct memory order.
# (B, T, nh, hd) -> (B, T, C)
merged_output = merged_output.view(B, T, C)
print("Shape of merged output:", merged_output.shape)
# 2. Pass through the final projection layer
c_proj = nn.Linear(C, C)
final_output = c_proj(merged_output)
print("Shape of final output:", final_output.shape)Output:
Shape of merged output: torch.Size([1, 4, 768])
Shape of final output: torch.Size([1, 4, 768])
We have successfully returned to our original (B, T, C) shape. Each token’s vector now contains the combined, context-aware information from all 12 attention heads
| Component | Shape Transformation | Purpose |
|---|---|---|
| Split Heads | (B, T, C) -> (B, nh, T, hd) | Prepare for parallel computation |
| Attention | (B, nh, T, hd) -> (B, nh, T, hd) | Each head computes context independently |
| Merge Heads | (B, nh, T, hd) -> (B, T, C) | Combine the insights from all heads |
| Final Projection | (B, T, C) -> (B, T, C) | Mix the combined information |
2. Encapsulating in the nn.Module
In the Causal Attention call from the GPT-2 code we can see that
The __init__ Method, We add the c_proj layer and an assertion to ensure the dimensions are compatible
class CausalSelfAttention(nn.Module):
def __init__(self, config: GPTConfig):
super().__init__()
assert config.n_embd % config.n_head == 0
self.n_head = config.n_head
self.n_embd = config.n_embd
# ... (c_attn and bias buffer from before)
self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=True)The forward Method this is the full implementation
def forward(self, x):
B, T, C = x.size()
# 1. Get QKV and split into heads
qkv = self.c_attn(x)
q, k, v = qkv.split(self.n_embd, dim=2)
head_dim = C // self.n_head
q = q.view(B, T, self.n_head, head_dim).transpose(1, 2)
k = k.view(B, T, self.n_head, head_dim).transpose(1, 2)
v = v.view(B, T, self.n_head, head_dim).transpose(1, 2)
# 2. Run causal self-attention on each head
att = (q @ k.transpose(-2, -1)) / math.sqrt(head_dim)
att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float("-inf"))
att = F.softmax(att, dim=-1)
y = att @ v
# 3. Merge heads and project
y = y.transpose(1, 2).contiguous().view(B, T, C)
y = self.c_proj(y)
return yEverything is kind of done. We just need to add the “thinking” layer (the MLP) and then stack these blocks together
8. MLP(Multi-Layer Perceptron) or FFN(Position-wise Feed-Forward Network) - Thinking Layer
This is the “communication” layer of the Transformer. It allows tokens to gather and aggregate information from their context. After each token has collected the context it needs, it needs time to “think” about it. It needs to process this new, context-rich information. Here is the MLP code
class MLP(nn.Module):
def __init__(self, config: GPTConfig):
super().__init__()
self.fc = nn.Linear(config.n_embd, 4 * config.n_embd)
self.proj = nn.Linear(4 * config.n_embd, config.n_embd)
self.drop = nn.Dropout(config.dropout)
def forward(self, x):
x = self.fc(x)
x = F.gelu(x) # GPT-2 uses GELU
x = self.drop(self.proj(x))
return xWhile the attention layer allows tokens to interact with each other, the MLP processes the information for each token independently
The MLP in a Transformer has a standard two-layer architecture:
- Expansion Layer (
fc): The first linear layer takes the input vector of sizen_embdand projects it up to a much larger, intermediate dimension, typically4 * n_embd - Non-Linearity (
gelu): An activation function is applied. GPT-2 uses GELU (Gaussian Error Linear Unit), which is a smooth alternative to the more common ReLU. This is what allows the network to learn complex, non-linear functions - Contraction Layer (
proj): The second linear layer projects the large intermediate vector back down to the originaln_embddimension - Dropout (
drop): A dropout layer is applied for regularization to prevent overfitting
1. Understanding nn.Linear
It’s just a matrix multiplication followed by the addition of a bias vector
The Math: output = input @ W^T + b For each output element, the layer calculates a weighted sum of all input elements and adds a bias. For example,
import torch
import torch.nn as nn
C_in = 2
C_out = 4
linear_layer = nn.Linear(C_in, C_out)What are the learnable parameters? This layer has two sets of learnable parameters that are updated during training:
- Weights (
.weight): A matrix of shape(C_out, C_in). For us, this is(4, 2). Total weights:4 * 2 = 8 - Biases (
.bias): A vector of shape(C_out). For us, this is(4). Total biases:4
Let’s take a hard-coded example to understand
# Manually set the weights
linear_layer.weight.data = torch.tensor([
[1., 0.], # Weights for output element 0
[-1., 0.], # Weights for output element 1
[0., 2.], # Weights for output element 2
[0., -2.] # Weights for output element 3
])
# Manually set the biases
linear_layer.bias.data = torch.tensor([1., 1., -1., -1.])Now let’s pass a single vector through it
# Our input vector
input_vector = torch.tensor([0.5, -0.5])
# The forward pass
output_vector = linear_layer(input_vector)output[0]=(input[0] * weight[0,0]) + (input[1] * weight[0,1]) + bias[0]output[0]=(0.5 * 1.0) + (-0.5 * 0.0) + 1.0output[0]=0.5 + 0.0 + 1.0=1.5
The result would be
print("Input vector:", input_vector)
print("Output vector:", output_vector)Output:
Input vector: tensor([ 0.5000, -0.5000])
Output vector: tensor([ 1.5000, 0.5000, -2.0000, 0.0000], grad_fn=<AddBackward0>)The output matches our manual calculation for the first element. The nn.Linear layer simply performs this weighted sum for each of the 4 output elements
2. Full MLP Walkthrough
Let’s trace a single token’s vector through the entire MLP forward pass. The MLP acts on each token independently, so we only need to look at one vector to understand the whole process
- We’ll use a tiny embedding dimension
C=2. - The MLP will expand this to an intermediate dimension of
4*C = 8. - Our input
xwill be the vector for a single token (T=1), in a batch of one (B=1).
# Our input vector for one token. Shape (B, T, C) -> (1, 1, 2)
x = torch.tensor([[[0.5, -0.5]]])- The Expansion Layer (
fc): This is annn.Linearlayer that projects fromC=2to4*C=8
# Create the layer
fc = nn.Linear(2, 8)
# Manually set its weights and biases for a clear example
fc.weight.data = torch.randn(8, 2) * 2 # Scale up for more interesting GELU results
fc.bias.data = torch.ones(8) # Set all biases to 1
# --- Pass the input through the layer ---
x_expanded = fc(x)
print("--- After Expansion Layer ---")
print("Shape:", x_expanded.shape)
print("Values:\n", x_expanded.data.round(decimals=2))Output:
--- After Expansion Layer ---
Shape: torch.Size([1, 1, 8])
Values:
tensor([[[ 2.4000, -0.5000, 1.8800, -1.9100, 2.0800, 1.1600, 0.4100, -2.1200]]])The 2-dimensional vector has been successfully expanded to an 8-dimensional one
- The GELU Activation: Next, we apply the non-linear GELU activation function. Intuitively, GELU is a smoother version of ReLU. It squashes negative values towards zero but allows a small amount of negative signal to pass through. Positive values are largely left unchanged
| Input | GELU(Input) |
|---|---|
| 2.4 | ~2.39 |
| 1.0 | ~0.84 |
| 0.0 | 0.0 |
| -0.5 | ~ -0.15 |
| -2.0 | ~ -0.00 |
Applying the same to expanded vector
import torch.nn.functional as F
# --- Apply GELU ---
x_activated = F.gelu(x_expanded)
print("\n--- After GELU Activation ---")
print("Shape:", x_activated.shape)
print("Values:\n", x_activated.data.round(decimals=2))Output:
--- After GELU Activation ---
Shape: torch.Size([1, 1, 8])
Values:
tensor([[[ 2.3900, -0.1500, 1.8700, -0.0100, 2.0600, 1.0300, 0.3100, -0.0000]]])As expected, the large positive values (2.40, 1.88) are almost untouched, while the large negative values (-1.91, -2.12) are squashed to nearly zero. This non-linear step is essential for the model to learn complex patterns
3. The Contraction Layer (proj): Now, we project the 8-dimensional activated vector back down to our original C=2 dimension.
# Create the layer
proj = nn.Linear(8, 2)
# Manually set its weights and biases
proj.weight.data = torch.randn(2, 8)
proj.bias.data = torch.zeros(2) # No bias for simplicity
# --- Pass the activated vector through the layer ---
x_projected = proj(x_activated)
print("\n--- After Contraction Layer ---")
print("Shape:", x_projected.shape)
print("Values:\n", x_projected.data.round(decimals=2))Output:
--- After Contraction Layer ---
Shape: torch.Size([1, 1, 2])
Values:
tensor([[[ 1.0900, -1.3800]]])We are back to our original shape of (1, 1, 2)
4. Dropout: The final step in the MLP is dropout
drop = nn.Dropout(0.1)
final_output = drop(x_projected)During training, this layer would randomly set 10% of the elements in x_projected to zero. This is a regularization technique that helps prevent the model from becoming too reliant on any single feature. During inference/evaluation (when we call model.eval()), the dropout layer does nothing and simply passes the data through unchanged. For our numerical example, we can assume it does nothing
The Final Result Our initial input vector [[[0.5, -0.5]]] has been transformed by the MLP into [[[ 1.09, -1.38]]]. This new vector, which has undergone a non-linear “thinking” process, is now ready for the next stage
The key takeaway is that the MLP transforms the input vector while preserving its shape (B, T, C). This is critical, as it allows us to add this output back to the original input (a “residual connection”) and to stack multiple Transformer Blocks on top of each other
We have now built both major components of our Transformer block: CausalSelfAttention (communication) and MLP (thinking). The final step is to assemble them into a complete Block
9. Residual Connections
Now we have the “communication” layer where tokens exchange information, and we also have “thinking” layer where each token processed the information it has gathered, now we need to connect this two in an repeatable Block that we introduce as an architectural glue that makes Deep Learning Possible. Here is the code
class Block(nn.Module):
def __init__(self, config: GPTConfig):
super().__init__()
# We will discuss these LayerNorm layers in the next chapter
self.ln_1 = nn.LayerNorm(config.n_embd)
self.attn = CausalSelfAttention(config)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.mlp = MLP(config)
def forward(self, x):
"""
The forward pass of a single Transformer Block.
"""
# --- This is our focus: the addition operation ---
# The output of the attention layer is ADDED to the original input 'x'.
x = x + self.attn(self.ln_1(x))
# --- And this one too ---
# The output of the MLP is ADDED to the result of the first step.
x = x + self.mlp(self.ln_2(x))
return xQ Why can’t we just stack the data via layers, why we need to add them into input ?
A A natural first instinct when building a deep model is to just stack layers sequentially: x -> layer1 -> layer2 -> layer3 -> .... However, when networks get very deep (e.g., more than a dozen layers), this simple approach often fails
The reason is a phenomenon called the vanishing gradient problem. During training, the learning signal (the gradient) must travel backward from the final output all the way to the first layer’s weights. With each step backward through a layer, this signal is multiplied by the layer’s weights. In many cases, this causes the signal to shrink exponentially. By the time it reaches the early layers, it’s so vanishingly small that those layers barely learn at all
The Solution: The Residual “Express Lane”
The residual connection provides an elegant solution by creating a “shortcut” or an “express lane” for the data and, more importantly, for the gradient
graph TD subgraph Attention Sub-Layer B(LayerNorm) --> C(CausalSelfAttention) end A[Input x] --> B C --> D["(+)"] A --"Residual Connection (Express Lane)"--> D D --> E[Output]
By adding the original input x directly to the output of the sub-layer (self.attn(...)), we create an uninterrupted highway. During backpropagation, the gradient can flow directly through this addition operator, completely bypassing the complex transformations inside the attn layer
This changes the learning objective. The network no longer needs to learn the entire, complex transformation from scratch. Instead, the attn layer only needs to learn the residual—the difference, or “delta,” that should be applied to the input
Intuition: Imagine you’re teaching a painter
- Without Residuals (Hard): “Here is a blank canvas. Paint a masterpiece.”
- With Residuals (Easy): “Here is the current painting (
x). Just make these small, incremental adjustments (attn(self.ln_1(x)))”
The final result is x + attn(self.ln_1(x)). It is much easier for a network to learn how to make small, iterative adjustments than it is to learn the entire transformation at every single layer
Walkthrough with Numbers
The operation is a simple element-wise addition. Let’s focus on a single token for clarity (B=1, T=1) with an embedding dimension of C=4
import torch
# Our input vector for a single token, 'x' at the start of the forward pass
x_initial = torch.tensor([[[0.2, 0.1, 0.3, 0.4]]])
print("Original input x:\n", x_initial)
# Let's pretend this is the output of `self.attn(self.ln_1(x))`.
# It represents the "change" or "adjustment" to be made.
attention_output = torch.tensor([[[0.1, -0.1, 0.2, -0.3]]])
print("\nOutput from the Attention sub-layer (the 'adjustment'):\n", attention_output)
# The residual connection is the first line of the forward pass: x = x + ...
x_after_attn = x_initial + attention_output
print("\nValue of x after the first residual connection:\n", x_after_attn)Output:
Original input x:
tensor([[[0.2000, 0.1000, 0.3000, 0.4000]]])
Output from the Attention sub-layer (the 'adjustment'):
tensor([[[ 0.1000, -0.1000, 0.2000, -0.3000]]])
Value of x after the first residual connection:
tensor([[[0.3000, 0.0000, 0.5000, 0.1000]]])The output of the attention sub-layer is just an update to the original vector. The shape of the tensor remains unchanged, which is a critical property
Step in forward | Operation | Input Shape | Output Shape | Meaning |
|---|---|---|---|---|
| 1 | self.attn(self.ln_1(x)) | (B, T, C) | (B, T, C) | Calculate the update/residual |
| 2 | x + ... | (B, T, C) | (B, T, C) | Apply the update to the original input |
We have now added the first piece of “glue” to our block. This express lane allows us to build much deeper and more powerful models. The next piece of glue we need is a stabilizer to keep the data flowing smoothly on this highway: Layer Normalization
10. Layer Normalization
We need a “stabilizer” to ensure the data flowing through our network remains well-behaved thus we use Layer Normalization. Here is the code
class Block(nn.Module):
def __init__(self, config: GPTConfig):
super().__init__()
# --- We define the LayerNorm layers here ---
self.ln_1 = nn.LayerNorm(config.n_embd)
self.attn = CausalSelfAttention(config)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.mlp = MLP(config)
def forward(self, x):
"""
The forward pass of a single Transformer Block.
"""
# --- LayerNorm is applied BEFORE the sub-layer ---
x = x + self.attn(self.ln_1(x))
# --- And here again ---
x = x + self.mlp(self.ln_2(x))
return xAs data flows through a deep network, the distribution of the activations at each layer is constantly changing during training. The mean and variance of the inputs to a given layer can shift wildly from one training batch to the next. This phenomenon is called Internal covariate shift
This makes training very difficult. It’s like trying to hit a moving target. Each layer has to constantly adapt to a new distribution of inputs from the layer before it, which can make the training process unstable and slow
The solution is to use Layer Normalization is a technique that forces the inputs to each sub-layer to have a consistent distribution. It acts as a stabilizer. For each individual token’s vector in our (B, T, C) tensor, it performs the following steps independently:
- Calculates the mean (μ) and variance (σ2) across the
C(embedding) dimension of that single vector - Normalizes the vector: x^=x−μ/σ2+ϵ $$ \hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}