Transformer

Understanding Transformers with GPT-2 Code

import math
from dataclasses import dataclass
import torch
import torch.nn as nn
import torch.nn.functional as F
 
@dataclass
class GPTConfig:
    vocab_size: int
    block_size: int
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768
    dropout: float = 0.1
 
class CausalSelfAttention(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        self.n_head, self.n_embd = config.n_head, config.n_embd
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        self.resid_drop = nn.Dropout(config.dropout)
        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size)).view(1, 1, config.block_size, config.block_size))
    def forward(self, x):
        B, T, C = x.size()
        qkv = self.c_attn(x)
        q, k, v = qkv.split(self.n_embd, dim=2)
        head_dim = C // self.n_head
        q = q.view(B, T, self.n_head, head_dim).transpose(1, 2)
        k = k.view(B, T, self.n_head, head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_head, head_dim).transpose(1, 2)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(head_dim))
        att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float("-inf"))
        att = F.softmax(att, dim=-1)
        y = att @ v
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        return self.resid_drop(self.c_proj(y))
 
class MLP(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.fc = nn.Linear(config.n_embd, 4 * config.n_embd)
        self.proj = nn.Linear(4 * config.n_embd, config.n_embd)
        self.drop = nn.Dropout(config.dropout)
    def forward(self, x):
        return self.drop(self.proj(F.gelu(self.fc(x))))
 
class Block(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)
    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x
 
class GPT2(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.config = config
        self.wte = nn.Embedding(config.vocab_size, config.n_embd)
        self.wpe = nn.Embedding(config.block_size, config.n_embd)
        self.drop = nn.Dropout(config.dropout)
        self.h = nn.ModuleList([Block(config) for _ in range(config.n_layer)])
        self.ln_f = nn.LayerNorm(config.n_embd)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.lm_head.weight = self.wte.weight
    def forward(self, idx, targets=None):
        B, T = idx.size()
        pos = torch.arange(0, T, dtype=torch.long, device=idx.device).unsqueeze(0)
        x = self.wte(idx) + self.wpe(pos)
        x = self.drop(x)
        for block in self.h:
            x = block(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1)) if targets is not None else None
        return logits, loss

Let’s break down code segment by segment and understand transformers

1. Data Config

These are the knobs we turn to make the model bigger or smaller

@dataclass
class GPTConfig:
    vocab_size: int
    block_size: int         # max sequence length (context window)
    n_layer: int = 12       # Number of Transformer Blocks to stack
    n_head: int = 12        # Number of attention "heads"
    n_embd: int = 768       # The dimensionality of our vectors
    dropout: float = 0.1

vocab_size - Vocabulary size - How many unique words the model knows - 50,257
block_size - Context window - How far back the model can “see” at once - 1024
n_layer - Model depth - How many blocks are stacked, more layers → more powerful - 12
n_head - Model width - How many parallel “conversations” attention can have - 12
n_embd - Embedding dimension - The “size” of the vectors representing each token - 768

2. Word-Vector Dictionary(Token Embeddings)

we convert the raw input(array of token Ids) into vectors which can be processed by neural networks

class GPT2(nn.Module):
    def __init__(self, config):
        # We are building THIS line now.
        self.wte = nn.Embedding(config.vocab_size, config.n_embd) # Word Token Embedding
        self.wpe = nn.Embedding(...) # (Next chapter)
 
        # The rest of the model
        self.h = nn.ModuleList(...)
        self.ln_f = nn.LayerNorm(...)
        self.lm_head = nn.Linear(...)

The input to our model is a tensor of token IDs, like torch.tensor([[5, 21]]). These are categorical numbers. The ID 21 doesn’t have 4.2 times the “value” of ID 5. The numerical distance between them is arbitrary and meaningless. A neural network, which relies on matrix multiplication and gradient descent, cannot learn from these raw IDs. They are just pointers
Thus we need to convert them into vectors and map them into vector space(incase of GPT-2 it’s 768 dimension vector space) thus we can the vectors to understand
When they are converted into embeddings we can perform search on the existing vectors which is just a simple lookup table stored in a singled weighted matrix, this functionality is exposed as an in-built function in PyTorch with name nn.Embedding(vocab_size, dimensions_n)
Also when training we also provide a parameter called requires_grad=True

Despite all this one question still exists How can a token’s vector representation be dynamically adjusted based on its context? it means the meaning of the word “bank” is different in different sentences, to solve this problem we need to give the model a sense of order, that can be done by positional embeddings

3. Positional Embeddings

Providing vectors a sense of order, such that they can differentiate in meaning of the sentences like differentiating between “Dog bites man” and “Man bites dog”

class GPT2(nn.Module):
    def __init__(self, config):
        self.wte = nn.Embedding(...) # (Done)
 
        # We are building THIS line now.
        self.wpe = nn.Embedding(config.block_size, config.n_embd) # Positional Embedding
 
        self.h = nn.ModuleList(...)
        self.ln_f = nn.LayerNorm(...)
        self.lm_head = nn.Linear(...)

Our current output is a tensor of shape (B, T, C), where C is n_embd. For a sequence like ["Man", "bites", "dog"], the model receives a set of three vectors: {vector("Man"), vector("bites"), vector("dog")}. If we shuffled the input, the model would receive the exact same set of vectors, just in a different order along the T dimension. The core processing layers (the Transformer blocks) are designed to be order-invariant, so without modification, they would produce the same result. We need to explicitly “stamp” each token’s vector with its position
The solution used in GPT is wonderfully simple. how there is an unique vector for each word, we will also learn a unique vector for each position, to add this we just need to add an additional nn.Embedding layer
- We’ll have a vector that means “I am at the 1st position”
- We’ll have another vector that means “I am at the 2nd position”
- …and so on, up to the maximum sequence length (block_size)
From this, first we get the words from nearest neighbors and from them we figure out the order with the help of positional embeddings
It works because the token and positional embeddings exist in the same high-dimensional space, the model can learn to interpret the combined vector. During training, it learns to create positional vectors such that adding vector(pos=N) to vector(word=W) produces a unique representation that distinguishes it from the same word at a different position. The network learns to “understand” this composition

Q What if the input sequence is shorter than block_size?
A This is the normal case! As you see above, our block_size is 8, but our input sequence length T is only 5. The code torch.arange(0, T, ...) handles this perfectly. We only generate and look up the positional embeddings for the sequence length we are currently processing. We never use the full block_size unless our input is that long

4. Self Attention

we established that our model gives the same starting vector to a word regardless of its context. This is a problem for ambiguous words. for example consider a word “Crane”, the meaning of crane is different in different sentences like “Crane ate a fish” and “Crane lifted steel”, here the starting vector is identical for both the sentences, we need to update this vector based on the neighbors
The most important component in Transformer is Causal Self Attention, it’s all governed by this one formula

Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}} + M) V

This entire process of using the formula to solve this problem happens in three steps: Scoring, Normalizing and Aggregating

Here the Variables Q, K and V are three distinct vectors for each single word, created by projecting input vector x. Thus this specify as follows
- Query(Q): The word’s “search query”. What it’s looking for
- Key(K): The word’s “label” or “keyword”. What it is
- Value(V): The word’s “payload”. The information it offers. We use this value despite having information of x because the raw information is not always the best information to share. The V vector is a transformed version of x, specifically packaged for other tokens to consume

Let’s run the self attention on the above “Crane” example. For now let’s consider 2D space

Dimension 1: Represents “Is it an Animal?”
Dimension 2: Represents “Is it a Machine?”

The ambiguous word “crane” will have vectors balanced between these possibilities

Token	Q - “I’m looking for…”	K - “I am…”	V - “I offer this info…“
ate	…	`[0.9, 0.1]` (High Animal)	`[0.9, 0.1]`
fish	…	`[0.9, 0.1]` (High Animal)	`[0.8, 0.2]`
lifted	…	`[0.1, 0.9]` (High Machine)	`[0.1, 0.9]`
steel	…	`[0.1, 0.9]` (High Machine)	`[0.2, 0.8]`
crane	`[0.7, 0.7]`	`[0.7, 0.7]`	`[0.5, 0.5]` (Ambiguous)
Sentence 1: “Crane ate fish”

Scoring (QK^T): The crane token uses its query [0.7, 0.7] to probe all keys in the sentence.
- Score(crane → crane): [0.7, 0.7] ⋅ [0.7, 0.7] = 0.49 + 0.49 = 0.98
- Score(crane → ate): [0.7, 0.7] ⋅ [0.9, 0.1] = 0.63 + 0.07 = 0.70
- Score(crane → fish): [0.7, 0.7] ⋅ [0.9, 0.1] = 0.63 + 0.07 = 0.70
Normalizing (softmax): The raw scores [0.98, 0.70, 0.70] are converted to percentages.
- Attention Weights: [0.4, 0.3, 0.3] This means crane will construct its new self by listening 40% to its original self, 30% to ate, and 30% to fish.
Aggregating (...V): The new vector for crane is a weighted sum of the Values.
- New_Vector(crane) = 0.4*V(crane) + 0.3*V(ate) + 0.3*V(fish)
- New_Vector(crane) = 0.4*[0.5, 0.5] + 0.3*[0.9, 0.1] + 0.3*[0.8, 0.2]
- New_Vector(crane) = [0.20, 0.20] + [0.27, 0.03] + [0.24, 0.06] = [0.71, 0.29]

The result is a new “crane” vector that is heavily skewed towards Dimension 1 (Animal). The context from ate and fish has resolved the ambiguity. It’s a bird

Sentence 2: “Crane lifted steel”

Scoring (QK^T): crane uses the exact same query [0.7, 0.7] on its new neighbors.
- Score(crane → crane): [0.7, 0.7] ⋅ [0.7, 0.7] = 0.98
- Score(crane → lifted): [0.7, 0.7] ⋅ [0.1, 0.9] = 0.07 + 0.63 = 0.70
- Score(crane → steel): [0.7, 0.7] ⋅ [0.1, 0.9] = 0.07 + 0.63 = 0.70
Normalizing (softmax): The raw scores [0.98, 0.70, 0.70] are identical to before.
- Attention Weights: [0.4, 0.3, 0.3] The percentages are the same, but they now apply to a different set of tokens!
Aggregating (...V):
- New_Vector(crane) = 0.4*V(crane) + 0.3*V(lifted) + 0.3*V(steel)
- New_Vector(crane) = 0.4*[0.5, 0.5] + 0.3*[0.1, 0.9] + 0.3*[0.2, 0.8]
- New_Vector(crane) = [0.20, 0.20] + [0.03, 0.27] + [0.06, 0.24] = [0.29, 0.71]

The result is a vector now heavily skewed towards Dimension 2 (Machine). The exact same initial “crane” vector has been transformed into a completely different, context-aware vector because it listened to different dominant neighbors Now that the intuition is solid, we can finally implement it with matrices

5. Scaled Dot-Product Attention

We have everything, the vectors, the context and meaning between the context, now we need to translate it into the language of Linear Algebra for efficient matrix operations using PyTorch, by implementing the core attention formula and encapsulated it into a reusable nn.Module. We do it in two parts

1. Building raw tensors to see every number

In the sentence “A crane ate fish”. We now have 4 tokens (T=4) and our toy embedding dimension is 2 (C=2). We’ll process one sentence at a time (B=1)

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from dataclasses import dataclass
 
B, T, C = 1, 4, 2  # Batch, Time (sequence length), Channels (embedding dim)
x = torch.tensor([
    [[0.1, 0.1],   # A
     [1.0, 0.2],   # crane (mostly object, slightly action)
     [0.1, 0.9],   # ate (mostly action)
     [0.8, 0.0]]   # fish (purely object)
]).float()

The Input have a raw, context free embeddings, this is the tensor from our embedding layers. Dim1=“Object-like”, Dim2=“Action-like”

Projecting x into Q, K, and V
- To get our Query, Key, and Value vectors, we use learnable linear transformations. These nn.Linear layers are the “brains” of the operation; their weights are updated during training

# The learnable components
q_proj = nn.Linear(C, C, bias=False)
k_proj = nn.Linear(C, C, bias=False)
v_proj = nn.Linear(C, C, bias=False)
 
# Manually set weights for this tutorial
torch.manual_seed(42)
q_proj.weight.data = torch.randn(C, C)
k_proj.weight.data = torch.randn(C, C)
v_proj.weight.data = torch.randn(C, C)
 
# --- Perform the projections ---
q = q_proj(x)
k = k_proj(x)
v = v_proj(x)

Tracking tensor shapes and their meaning

Variable	Shape `(B, T, C)`	Meaning
`x`	`(1, 4, 2)`	The batch of raw input vectors.
`q`	`(1, 4, 2)`	The “Query” vector for each of the 4 tokens.
`k`	`(1, 4, 2)`	The “Key” vector for each of the 4 tokens.
`v`	`(1, 4, 2)`	The “Value” vector for each of the 4 tokens.

Calculate Attention Scores (q @ k.transpose)
- This is the core of the communication. We need to compute the dot product of every token’s query with every other token’s key. We can do this with a single, efficient matrix multiplication
  - q has shape (1, 4, 2)
  - k has shape (1, 4, 2)
  - To multiply them, we need to make their inner dimensions match. We use .transpose(-2, -1) to swap the last two dimensions of k
  - k.transpose(-2, -1) results in a shape of (1, 2, 4)
  - The multiplication is (1, 4, 2) @ (1, 2, 4), which results in a (1, 4, 4) matrix

# --- Score Calculation ---
scores = q @ k.transpose(-2, -1)
 
print("--- Raw Scores (Attention Matrix) ---")
print(scores.shape)
print(scores)

Output:

--- Raw Scores (Attention Matrix) ---
torch.Size([1, 4, 4])
tensor([[[ 0.0531,  0.4137,  0.1802,  0.2721],   # "A" scores for (A, crane, ate, fish)
         [ 0.1782,  1.3888,  0.6053,  0.9101],   # "crane" scores for (A, crane, ate, fish)
         [ 0.0618,  0.4815,  0.2098,  0.3151],   # "ate" scores for (A, crane, ate, fish)
         [ 0.1260,  0.9822,  0.4280,  0.6433]]])  # "fish" scores for (A, crane, ate, fish)

This (4, 4) matrix holds the raw compatibility scores. For example, the query for “crane” (row 1) has the highest compatibility with the key for “crane” (column 1), which is 1.3888

Scale and SoftMax
- We scale the scores for stability, then use softmax to turn them into attention weights that sum to 1 for each row

d_k = k.size(-1)
scaled_scores = scores / math.sqrt(d_k)
attention_weights = F.softmax(scaled_scores, dim=-1) # Softmax along the rows

Aggregate the Values (attention_weights @ v)
- Now we use our weights to create a weighted average of the Value vectors
  - attention_weights has shape (1, 4, 4)
  - v has shape (1, 4, 2)
  - The multiplication (1, 4, 4) @ (1, 4, 2) produces a final tensor of shape (1, 4, 2)

# --- Value Aggregation ---
output = attention_weights @ v
 
print("\n--- Final Output (Context-Aware Vectors) ---")
print(output.shape)
print(output)

Output:

--- Final Output (Context-Aware Vectors) ---
torch.Size([1, 4, 2])
tensor([[[ 0.0652, -0.1691],
         [ 0.1147, -0.2974],
         [ 0.0768, -0.1991],
         [ 0.1005, -0.2607]]])

Gist of the tensor transformations done above:

Step	Operation	Input Shapes	Output Shape `(B, T, ...)`	Meaning
1	`q_proj(x)` etc.	`(1, 4, 2)`	`(1, 4, 2)`	Create Q, K, V for each token
2	`q @ k.T`	`(1, 4, 2)` & `(1, 2, 4)`	`(1, 4, 4)`	Raw compatibility scores
3	`/ sqrt(d_k)`	`(1, 4, 4)`	`(1, 4, 4)`	Stabilized scores
4	`softmax`	`(1, 4, 4)`	`(1, 4, 4)`	Attention probabilities
5	`att @ v`	`(1, 4, 4)` & `(1, 4, 2)`	`(1, 4, 2)`	Context-aware output vectors
Now we have taken our raw input `x` and produced a new tensor `output` of the exact same shape, where each token’s vector has been updated with information from its neighbors

2. Encapsulating the Logic in an `nn.Module`

Here is the complete, encapsulated code for a single attention head

class SingleHeadSelfAttention(nn.Module):
    def __init__(self, config):
        """
        Initializes the layers needed for self-attention.
        """
        super().__init__()
        # The single, fused linear layer for Q, K, V
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=False)
 
    def forward(self, x):
        """
        Defines the data flow through the module.
        Input x shape: (B, T, C)
        """
        B, T, C = x.size()
 
        # 1. Get Q, K, V from a single projection and split them
        qkv = self.c_attn(x)
        q, k, v = qkv.split(C, dim=2)
        
        # 2. Calculate attention weights
        # (B, T, C) @ (B, C, T) -> (B, T, T)
        scaled_scores = (q @ k.transpose(-2, -1)) / math.sqrt(k.size(-1))
        attention_weights = F.softmax(scaled_scores, dim=-1)
        
        # 3. Aggregate values
        # (B, T, T) @ (B, T, C) -> (B, T, C)
        output = attention_weights @ v
        
        return output

The __init__ method sets up the building blocks. Here, we only need one

self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=False)

Here we used three separate nn.Linear layers. This single line is a common and highly efficient optimization that achieves the same goal

Our Manual Walkthrough (Conceptually Clear)	Fused Layer (Computationally Efficient)
`q_proj = nn.Linear(C, C)`
`k_proj = nn.Linear(C, C)`	`c_attn = nn.Linear(C, 3*C)`
`v_proj = nn.Linear(C, C)`
Instead of three smaller matrix multiplications, the GPU can perform one larger, faster matrix multiplication. The `bias=False` argument is a common simplification used in minimal implementations like NanoGPT. Note that the original GPT-2 implementation does include biases in its linear projections

The Forward Method

Projection and Splitting

qkv = self.c_attn(x)
q, k, v = qkv.split(C, dim=2)

self.c_attn(x): We pass our input x (shape B, T, C) through the fused layer, resulting in a qkv tensor of shape (B, T, 3*C).
qkv.split(C, dim=2): This is the clever part. The .split() function carves up the tensor. We tell it: “Along dimension 2 (the last dimension), create chunks of size C.” Since the total dimension is 3*C, this gives us exactly three tensors, each with the desired shape of (B, T, C), which we assign to q, k, and v

Calculating Attention Weights

scaled_scores = (q @ k.transpose(-2, -1)) / math.sqrt(k.size(-1))
attention_weights = F.softmax(scaled_scores, dim=-1)

This is a direct, one-to-one implementation of the mathematical formula.

k.transpose(-2, -1) swaps the T and C dimensions of the Key tensor to prepare for matrix multiplication
q @ ... performs the dot product, resulting in the raw score matrix of shape (B, T, T)
/ math.sqrt(k.size(-1)) performs the scaling for stability
F.softmax(...) converts the raw scores into a probability distribution along each row

Aggregating Values

output = attention_weights @ v

Finally, we perform the last matrix multiplication. The attention weights (B, T, T) are multiplied with the Value vectors (B, T, C), resulting in our final output tensor of shape (B, T, C) Proof of Equivalence:
To prove this class is identical to our manual work, we can instantiate it and manually load the weights from our q_proj, k_proj, and v_proj layers into the single c_attn layer

@dataclass
class GPTConfig: n_embd: int
model = SingleHeadSelfAttention(GPTConfig(n_embd=C))
 
# The c_attn layer's weight matrix is shape (3*C, C). Our separate weights
# are each (C, C). We concatenate them along dim=0 to get (3*C, C).
model.c_attn.weight.data = torch.cat(
    [q_proj.weight.data, k_proj.weight.data, v_proj.weight.data], dim=0
)
 
# Run the model
model_output = model(x)
 
# 'output' is the tensor from our manual walkthrough in Part 1
print("Are the outputs the same?", torch.allclose(output, model_output))

Output:

Are the outputs the same? True

However, our model has a flaw for language generation: tokens can see into the future. Our current attention matrix allows this. We can fix this by adding a causal mask

6. Casual Masking

Now at this point there is an attention mechanism that allows tokens to communicate, but there is a flaw, the tokens can communicate with the tokens that are going to generate in the future, but we are trying to build an Autoregressive Model it means it generates text one token at a time, and also the next output token should only depend upon the already generated tokens, but not on the tokens that comes after the current token
For example, here in previous output

tensor([[[0.37, 0.32, 0.31, ...],  # "A" attends to all 4 tokens
         [0.31, 0.37, 0.32, ...],  # "crane" attends to all 4 tokens
         [0.36, 0.31, 0.33, ...],  # "ate" attends to all 4 tokens
         ...                       # "fish" attends to all 4 tokens
        ]]])

we can see that token “A” is gathering information from “Crane”, “ate” and “fish”. But it should not happen. To solve this problem we use Casual mask. It means, we will modify the attention score matrix before applying the SoftMax function, means we will “mask out” all the future positions by setting their scores to negative infinity(-inf)
We use -inf because the software function involves an exponential e^x / sum(e^x) and the exponential of negative infinity, e^-inf, is effectively zero. This forces the attention weights for all future tokens to become 0, preventing any information flow

1. Applying Operations on Raw Tensors

Previously we got vectors like this

# This is the scaled_scores tensor from the end of the last chapter
# Shape (B, T, T) -> (1, 4, 4)
scaled_scores = torch.tensor([[
    [ 0.0375,  0.2925,  0.1274,  0.1924],
    [ 0.1260,  0.9822,  0.4280,  0.6433],
    [ 0.0437,  0.3405,  0.1484,  0.2228],
    [ 0.0891,  0.6945,  0.3023,  0.4549]
]])

Creating the Mask
- We need a mask that allows a token to see itself and the past, but not the future. A lower-triangular matrix is perfect for this. We can create one easily with torch.tril

# T=4 for our sentence "A crane ate fish"
T = 4
mask = torch.tril(torch.ones(T, T))
print("--- The Mask ---")
print(mask)

Output:

--- The Mask ---
tensor([[1., 0., 0., 0.],
        [1., 1., 0., 0.],
        [1., 1., 1., 0.],
        [1., 1., 1., 1.]])

Row 0 (“A”) can only see column 0 (“A”)
Row 1 (“crane”) can see column 0 (“A”) and 1 (“crane”)
And so on. The zeros in the upper-right triangle represent the “future” connections that we must block

Applying the Mask
- By using PyTorch function masked_fill to apply our mask. This function will replace all values in scaled_scores with -inf wherever the corresponding position in our mask is 0

masked_scores = scaled_scores.masked_fill(mask == 0, float('-inf'))
print("\n--- Scores After Masking ---")
print(masked_scores)

Output:

--- Scores After Masking ---
tensor([[[ 0.0375,    -inf,    -inf,    -inf],
         [ 0.1260,  0.9822,    -inf,    -inf],
         [ 0.0437,  0.3405,  0.1484,    -inf],
         [ 0.0891,  0.6945,  0.3023,  0.4549]]])

Running SoftMax function again

attention_weights = F.softmax(masked_scores, dim=-1)
print("\n--- Final Causal Attention Weights ---")
print(attention_weights.data.round(decimals=2))

Output:

--- Final Causal Attention Weights ---
tensor([[[1.0000, 0.0000, 0.0000, 0.0000],
         [0.2995, 0.7005, 0.0000, 0.0000],
         [0.3129, 0.3807, 0.3064, 0.0000],
         [0.2186, 0.3999, 0.2445, 0.1370]]])

From this output we can derive that

“A” can only attend to itself (100%)
“crane” attends to “A” (30%) and “crane” (70%)
“ate” attends to “A”, “crane”, and “ate”. Information can now only flow from the past to the present

Attention Type	”crane” attends to “fish”?	“ate” attends to “fish”?
Unmasked (Ch 5)	Yes	Yes
Causal (Ch 6)	No (0%)	No (0%)

2. Encapsulating in the `nn.Module`

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        # ... (c_attn layer from before)
 
        # We register the mask as a "buffer"
        self.register_buffer(
            "bias", # name of the buffer
            torch.tril(torch.ones(config.block_size, config.block_size))
            .view(1, 1, config.block_size, config.block_size)
        )

In the __init__ Method the register_buffer We need to store our mask as part of the module. We use register_buffer for this #Q Why register_buffer?
A A buffer is a tensor that is part of the model’s state (like weights), so it gets moved to the GPU with .to(device). However, it is not a parameter that gets updated by the optimizer during training

The .view(1, 1, ...) part is to add extra dimensions for broadcasting, which will be essential for Multi-Head Attention
We add this masking step to forward() function

    def forward(self, x):
        B, T, C = x.size()
        # ... (get q, k, v as before)
        
        scaled_scores = (q @ k.transpose(-2, -1)) / math.sqrt(k.size(-1))
        
        # --- THE NEW LINE ---
        # We slice the stored mask to match the sequence length T of our input
        scaled_scores = scaled_scores.masked_fill(self.bias[:, :, :T, :T] == 0, float("-inf"))
        
        attention_weights = F.softmax(scaled_scores, dim=-1)
        output = attention_weights @ v
        return output

Until now everything is good, we have context, vectors at place with their meaning attached and only dependent on generated tokens but one draw back we have is only one conversation can be held at a time, it means everything that is essential is done by one thing, to solve this we use Multi-Head Attention

7. Multi-Head Attention

What if, instead of one overworked attention mechanism, we could have several working in parallel? This is the core idea of Multi-Head Attention. We will split our embedding dimension C into smaller chunks, called “heads”. Each head will be its own independent attention mechanism, complete with its own Q, K, and V projections

Head 1 might learn to focus on verb-object relationships
Head 2 might learn to focus on which pronouns refer to which nouns
Head 3 might learn to track long-range dependencies in the text
…and so on

Each head conducts its own “conversation” and produces its own context-aware output vector. At the end, we simply concatenate the results from all the heads and pass them through a final linear layer to combine the insights

1. Applying Operations on Raw Tensor

Let’s start with our q, k, and v tensors from Step 5

Shape: (B, T, C) → (1, 4, 768) (Let’s use a more realistic C for this example)
n_head: Let’s say we want 12 attention heads
head_dim: The dimension of each head will be C / n_head, which is 768 / 12 = 64

Splitting C into n_head and head_dim Our current q tensor has shape (1, 4, 768). We need to reshape it so that the 12 heads are explicit. The target shape is (B, n_head, T, head_dim) or (1, 12, 4, 64)
This is done with a sequence of view() and transpose() operations

# --- Configuration ---
B, T, C = 1, 4, 768
n_head = 12
head_dim = C // n_head # 768 // 12 = 64
 
# --- Dummy Q, K, V tensors with realistic shapes ---
q = torch.randn(B, T, C)
k = torch.randn(B, T, C)
v = torch.randn(B, T, C)
 
# --- Reshaping Q ---
# 1. Start with q: (B, T, C) -> (1, 4, 768)
# 2. Reshape to add the n_head dimension
q_reshaped = q.view(B, T, n_head, head_dim) # (1, 4, 12, 64)
# 3. Transpose to bring n_head to the front
q_final = q_reshaped.transpose(1, 2) # (1, 12, 4, 64)
 
print("Original Q shape:", q.shape)
print("Final reshaped Q shape:", q_final.shape)

Output:

Original Q shape: torch.Size([1, 4, 768])
Final reshaped Q shape: torch.Size([1, 12, 4, 64])

We do the exact same reshaping for k and v. Now, PyTorch’s broadcasting capabilities will treat the n_head dimension as a new “batch” dimension. All our subsequent attention calculations will be performed independently for all 12 heads at once

Run Attention in Parallel Our attention formula remains the same, but now it operates on tensors with an extra n_head dimension

# Reshape k and v as well
k_final = k.view(B, T, n_head, head_dim).transpose(1, 2) # (1, 12, 4, 64)
v_final = v.view(B, T, n_head, head_dim).transpose(1, 2) # (1, 12, 4, 64)
 
# --- Attention Calculation ---
# (B, nh, T, hd) @ (B, nh, hd, T) -> (B, nh, T, T)
scaled_scores = (q_final @ k_final.transpose(-2, -1)) / math.sqrt(head_dim)
 
# (We would apply the causal mask here)
 
attention_weights = F.softmax(scaled_scores, dim=-1)
 
# (B, nh, T, T) @ (B, nh, T, hd) -> (B, nh, T, hd)
output_per_head = attention_weights @ v_final
 
print("Shape of output from each head:", output_per_head.shape)

Output:

Shape of output from each head: torch.Size([1, 12, 4, 64])

We now have a (64-dimensional) output vector for each of our 4 tokens, from each of our 12 heads

Merging the Heads The last step is to combine the insights from all 12 heads. We do this by reversing the reshape operation: we concatenate the heads back together into a single C-dimensional vector and then pass it through a final linear projection layer (c_proj)

# 1. Transpose and reshape to merge the heads back together
# (B, nh, T, hd) -> (B, T, nh, hd)
merged_output = output_per_head.transpose(1, 2).contiguous()
# The .contiguous() is needed because transpose can mess with memory layout.
# It creates a new tensor with the elements in the correct memory order.
 
# (B, T, nh, hd) -> (B, T, C)
merged_output = merged_output.view(B, T, C)
 
print("Shape of merged output:", merged_output.shape)
 
# 2. Pass through the final projection layer
c_proj = nn.Linear(C, C)
final_output = c_proj(merged_output)
 
print("Shape of final output:", final_output.shape)

Output:

Shape of merged output: torch.Size([1, 4, 768])
Shape of final output: torch.Size([1, 4, 768])

We have successfully returned to our original (B, T, C) shape. Each token’s vector now contains the combined, context-aware information from all 12 attention heads

Component	Shape Transformation	Purpose
Split Heads	`(B, T, C) -> (B, nh, T, hd)`	Prepare for parallel computation
Attention	`(B, nh, T, hd) -> (B, nh, T, hd)`	Each head computes context independently
Merge Heads	`(B, nh, T, hd) -> (B, T, C)`	Combine the insights from all heads
Final Projection	`(B, T, C) -> (B, T, C)`	Mix the combined information

2. Encapsulating in the `nn.Module`

In the Causal Attention call from the GPT-2 code we can see that The __init__ Method, We add the c_proj layer and an assertion to ensure the dimensions are compatible

class CausalSelfAttention(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        # ... (c_attn and bias buffer from before)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=True)

The forward Method this is the full implementation

    def forward(self, x):
        B, T, C = x.size()
 
        # 1. Get QKV and split into heads
        qkv = self.c_attn(x)
        q, k, v = qkv.split(self.n_embd, dim=2)
        head_dim = C // self.n_head
        q = q.view(B, T, self.n_head, head_dim).transpose(1, 2)
        k = k.view(B, T, self.n_head, head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_head, head_dim).transpose(1, 2)
 
        # 2. Run causal self-attention on each head
        att = (q @ k.transpose(-2, -1)) / math.sqrt(head_dim)
        att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float("-inf"))
        att = F.softmax(att, dim=-1)
        y = att @ v
 
        # 3. Merge heads and project
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        y = self.c_proj(y)
        
        return y

Everything is kind of done. We just need to add the “thinking” layer (the MLP) and then stack these blocks together

8. MLP(Multi-Layer Perceptron) or FFN(Position-wise Feed-Forward Network) - Thinking Layer

This is the “communication” layer of the Transformer. It allows tokens to gather and aggregate information from their context. After each token has collected the context it needs, it needs time to “think” about it. It needs to process this new, context-rich information. Here is the MLP code

class MLP(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.fc = nn.Linear(config.n_embd, 4 * config.n_embd)
        self.proj = nn.Linear(4 * config.n_embd, config.n_embd)
        self.drop = nn.Dropout(config.dropout)
 
    def forward(self, x):
        x = self.fc(x)
        x = F.gelu(x)  # GPT-2 uses GELU
        x = self.drop(self.proj(x))
        return x

While the attention layer allows tokens to interact with each other, the MLP processes the information for each token independently
The MLP in a Transformer has a standard two-layer architecture:

Expansion Layer (fc): The first linear layer takes the input vector of size n_embd and projects it up to a much larger, intermediate dimension, typically 4 * n_embd
Non-Linearity (gelu): An activation function is applied. GPT-2 uses GELU (Gaussian Error Linear Unit), which is a smooth alternative to the more common ReLU. This is what allows the network to learn complex, non-linear functions
Contraction Layer (proj): The second linear layer projects the large intermediate vector back down to the original n_embd dimension
Dropout (drop): A dropout layer is applied for regularization to prevent overfitting

1. Understanding `nn.Linear`

It’s just a matrix multiplication followed by the addition of a bias vector
The Math: output = input @ W^T + b For each output element, the layer calculates a weighted sum of all input elements and adds a bias. For example,

import torch
import torch.nn as nn
 
C_in = 2
C_out = 4
linear_layer = nn.Linear(C_in, C_out)

What are the learnable parameters? This layer has two sets of learnable parameters that are updated during training:

Weights (.weight): A matrix of shape (C_out, C_in). For us, this is (4, 2). Total weights: 4 * 2 = 8
Biases (.bias): A vector of shape (C_out). For us, this is (4). Total biases: 4

Let’s take a hard-coded example to understand

# Manually set the weights
linear_layer.weight.data = torch.tensor([
    [1., 0.],  # Weights for output element 0
    [-1., 0.], # Weights for output element 1
    [0., 2.],  # Weights for output element 2
    [0., -2.]  # Weights for output element 3
])
 
# Manually set the biases
linear_layer.bias.data = torch.tensor([1., 1., -1., -1.])

Now let’s pass a single vector through it

# Our input vector
input_vector = torch.tensor([0.5, -0.5])
 
# The forward pass
output_vector = linear_layer(input_vector)

output[0] = (input[0] * weight[0,0]) + (input[1] * weight[0,1]) + bias[0]
output[0] = (0.5 * 1.0) + (-0.5 * 0.0) + 1.0
output[0] = 0.5 + 0.0 + 1.0 = 1.5

The result would be

print("Input vector:", input_vector)
print("Output vector:", output_vector)

Output:

Input vector: tensor([ 0.5000, -0.5000])
Output vector: tensor([ 1.5000,  0.5000, -2.0000,  0.0000], grad_fn=<AddBackward0>)

The output matches our manual calculation for the first element. The nn.Linear layer simply performs this weighted sum for each of the 4 output elements

2. Full MLP Walkthrough

Let’s trace a single token’s vector through the entire MLP forward pass. The MLP acts on each token independently, so we only need to look at one vector to understand the whole process

We’ll use a tiny embedding dimension C=2.
The MLP will expand this to an intermediate dimension of 4*C = 8.
Our input x will be the vector for a single token (T=1), in a batch of one (B=1).

# Our input vector for one token. Shape (B, T, C) -> (1, 1, 2)
x = torch.tensor([[[0.5, -0.5]]])

The Expansion Layer (fc): This is an nn.Linear layer that projects from C=2 to 4*C=8

# Create the layer
fc = nn.Linear(2, 8)
 
# Manually set its weights and biases for a clear example
fc.weight.data = torch.randn(8, 2) * 2 # Scale up for more interesting GELU results
fc.bias.data = torch.ones(8) # Set all biases to 1
 
# --- Pass the input through the layer ---
x_expanded = fc(x)
 
print("--- After Expansion Layer ---")
print("Shape:", x_expanded.shape)
print("Values:\n", x_expanded.data.round(decimals=2))

Output:

--- After Expansion Layer ---
Shape: torch.Size([1, 1, 8])
Values:
 tensor([[[ 2.4000, -0.5000,  1.8800, -1.9100,  2.0800,  1.1600,  0.4100, -2.1200]]])

The 2-dimensional vector has been successfully expanded to an 8-dimensional one

The GELU Activation: Next, we apply the non-linear GELU activation function. Intuitively, GELU is a smoother version of ReLU. It squashes negative values towards zero but allows a small amount of negative signal to pass through. Positive values are largely left unchanged

Input	GELU(Input)
2.4	~2.39
1.0	~0.84
0.0	0.0
-0.5	~ -0.15
-2.0	~ -0.00

Applying the same to expanded vector

import torch.nn.functional as F
 
# --- Apply GELU ---
x_activated = F.gelu(x_expanded)
 
print("\n--- After GELU Activation ---")
print("Shape:", x_activated.shape)
print("Values:\n", x_activated.data.round(decimals=2))

Output:

--- After GELU Activation ---
Shape: torch.Size([1, 1, 8])
Values:
 tensor([[[ 2.3900, -0.1500,  1.8700, -0.0100,  2.0600,  1.0300,  0.3100, -0.0000]]])

As expected, the large positive values (2.40, 1.88) are almost untouched, while the large negative values (-1.91, -2.12) are squashed to nearly zero. This non-linear step is essential for the model to learn complex patterns 3. The Contraction Layer (proj): Now, we project the 8-dimensional activated vector back down to our original C=2 dimension.

# Create the layer
proj = nn.Linear(8, 2)
 
# Manually set its weights and biases
proj.weight.data = torch.randn(2, 8)
proj.bias.data = torch.zeros(2) # No bias for simplicity
 
# --- Pass the activated vector through the layer ---
x_projected = proj(x_activated)
 
print("\n--- After Contraction Layer ---")
print("Shape:", x_projected.shape)
print("Values:\n", x_projected.data.round(decimals=2))

Output:

--- After Contraction Layer ---
Shape: torch.Size([1, 1, 2])
Values:
 tensor([[[ 1.0900, -1.3800]]])

We are back to our original shape of (1, 1, 2) 4. Dropout: The final step in the MLP is dropout

drop = nn.Dropout(0.1)
final_output = drop(x_projected)

During training, this layer would randomly set 10% of the elements in x_projected to zero. This is a regularization technique that helps prevent the model from becoming too reliant on any single feature. During inference/evaluation (when we call model.eval()), the dropout layer does nothing and simply passes the data through unchanged. For our numerical example, we can assume it does nothing
The Final Result Our initial input vector [[[0.5, -0.5]]] has been transformed by the MLP into [[[ 1.09, -1.38]]]. This new vector, which has undergone a non-linear “thinking” process, is now ready for the next stage
The key takeaway is that the MLP transforms the input vector while preserving its shape (B, T, C). This is critical, as it allows us to add this output back to the original input (a “residual connection”) and to stack multiple Transformer Blocks on top of each other
We have now built both major components of our Transformer block: CausalSelfAttention (communication) and MLP (thinking). The final step is to assemble them into a complete Block

9. Residual Connections

Now we have the “communication” layer where tokens exchange information, and we also have “thinking” layer where each token processed the information it has gathered, now we need to connect this two in an repeatable Block that we introduce as an architectural glue that makes Deep Learning Possible. Here is the code

class Block(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        # We will discuss these LayerNorm layers in the next chapter
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)
 
    def forward(self, x):
        """
        The forward pass of a single Transformer Block.
        """
        # --- This is our focus: the addition operation ---
        # The output of the attention layer is ADDED to the original input 'x'.
        x = x + self.attn(self.ln_1(x))
        
        # --- And this one too ---
        # The output of the MLP is ADDED to the result of the first step.
        x = x + self.mlp(self.ln_2(x))
        return x

Q Why can’t we just stack the data via layers, why we need to add them into input ?
A A natural first instinct when building a deep model is to just stack layers sequentially: x -> layer1 -> layer2 -> layer3 -> .... However, when networks get very deep (e.g., more than a dozen layers), this simple approach often fails
The reason is a phenomenon called the vanishing gradient problem. During training, the learning signal (the gradient) must travel backward from the final output all the way to the first layer’s weights. With each step backward through a layer, this signal is multiplied by the layer’s weights. In many cases, this causes the signal to shrink exponentially. By the time it reaches the early layers, it’s so vanishingly small that those layers barely learn at all

The Solution: The Residual “Express Lane”

The residual connection provides an elegant solution by creating a “shortcut” or an “express lane” for the data and, more importantly, for the gradient

graph TD
    subgraph Attention Sub-Layer
        B(LayerNorm) --> C(CausalSelfAttention)
    end

    A[Input x] --> B
    C --> D["(+)"]
    A --"Residual Connection (Express Lane)"--> D
    D --> E[Output]

By adding the original input x directly to the output of the sub-layer (self.attn(...)), we create an uninterrupted highway. During backpropagation, the gradient can flow directly through this addition operator, completely bypassing the complex transformations inside the attn layer
This changes the learning objective. The network no longer needs to learn the entire, complex transformation from scratch. Instead, the attn layer only needs to learn the residual—the difference, or “delta,” that should be applied to the input
Intuition: Imagine you’re teaching a painter

Without Residuals (Hard): “Here is a blank canvas. Paint a masterpiece.”
With Residuals (Easy): “Here is the current painting (x). Just make these small, incremental adjustments (attn(self.ln_1(x)))”

The final result is x + attn(self.ln_1(x)). It is much easier for a network to learn how to make small, iterative adjustments than it is to learn the entire transformation at every single layer

Walkthrough with Numbers

The operation is a simple element-wise addition. Let’s focus on a single token for clarity (B=1, T=1) with an embedding dimension of C=4

import torch
 
# Our input vector for a single token, 'x' at the start of the forward pass
x_initial = torch.tensor([[[0.2, 0.1, 0.3, 0.4]]])
print("Original input x:\n", x_initial)
 
# Let's pretend this is the output of `self.attn(self.ln_1(x))`.
# It represents the "change" or "adjustment" to be made.
attention_output = torch.tensor([[[0.1, -0.1, 0.2, -0.3]]])
print("\nOutput from the Attention sub-layer (the 'adjustment'):\n", attention_output)
 
# The residual connection is the first line of the forward pass: x = x + ...
x_after_attn = x_initial + attention_output
print("\nValue of x after the first residual connection:\n", x_after_attn)

Output:

Original input x:
 tensor([[[0.2000, 0.1000, 0.3000, 0.4000]]])
 
Output from the Attention sub-layer (the 'adjustment'):
 tensor([[[ 0.1000, -0.1000,  0.2000, -0.3000]]])
 
Value of x after the first residual connection:
 tensor([[[0.3000, 0.0000, 0.5000, 0.1000]]])

The output of the attention sub-layer is just an update to the original vector. The shape of the tensor remains unchanged, which is a critical property

Step in `forward`	Operation	Input Shape	Output Shape	Meaning
1	`self.attn(self.ln_1(x))`	`(B, T, C)`	`(B, T, C)`	Calculate the update/residual
2	`x + ...`	`(B, T, C)`	`(B, T, C)`	Apply the update to the original input

We have now added the first piece of “glue” to our block. This express lane allows us to build much deeper and more powerful models. The next piece of glue we need is a stabilizer to keep the data flowing smoothly on this highway: Layer Normalization

10. Layer Normalization

We need a “stabilizer” to ensure the data flowing through our network remains well-behaved thus we use Layer Normalization. Here is the code

class Block(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        # --- We define the LayerNorm layers here ---
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)
 
    def forward(self, x):
        """
        The forward pass of a single Transformer Block.
        """
        # --- LayerNorm is applied BEFORE the sub-layer ---
        x = x + self.attn(self.ln_1(x))
        
        # --- And here again ---
        x = x + self.mlp(self.ln_2(x))
        return x

As data flows through a deep network, the distribution of the activations at each layer is constantly changing during training. The mean and variance of the inputs to a given layer can shift wildly from one training batch to the next. This phenomenon is called Internal covariate shift
This makes training very difficult. It’s like trying to hit a moving target. Each layer has to constantly adapt to a new distribution of inputs from the layer before it, which can make the training process unstable and slow The solution is to use Layer Normalization is a technique that forces the inputs to each sub-layer to have a consistent distribution. It acts as a stabilizer. For each individual token’s vector in our (B, T, C) tensor, it performs the following steps independently:

Calculates the mean (μ) and variance (σ2) across the C (embedding) dimension of that single vector
Normalizes the vector: x^=x−μ/σ2+ϵ $$ \hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}

Sadiq's Knowledge Vaults

Explorer

1. Data Config

2. Word-Vector Dictionary(Token Embeddings)

3. Positional Embeddings

4. Self Attention

5. Scaled Dot-Product Attention

1. Building raw tensors to see every number

2. Encapsulating the Logic in an `nn.Module`

The Forward Method

6. Casual Masking

1. Applying Operations on Raw Tensors

2. Encapsulating in the `nn.Module`

7. Multi-Head Attention

1. Applying Operations on Raw Tensor

2. Encapsulating in the `nn.Module`

8. MLP(Multi-Layer Perceptron) or FFN(Position-wise Feed-Forward Network) - Thinking Layer

1. Understanding `nn.Linear`

2. Full MLP Walkthrough

9. Residual Connections

The Solution: The Residual “Express Lane”

Walkthrough with Numbers

10. Layer Normalization

Graph View

Sadiq's Knowledge Vaults

Explorer

Transformer

1. Data Config

2. Word-Vector Dictionary(Token Embeddings)

3. Positional Embeddings

4. Self Attention

5. Scaled Dot-Product Attention

1. Building raw tensors to see every number

2. Encapsulating the Logic in an nn.Module

The Forward Method

6. Casual Masking

1. Applying Operations on Raw Tensors

2. Encapsulating in the nn.Module

7. Multi-Head Attention

1. Applying Operations on Raw Tensor

2. Encapsulating in the nn.Module

8. MLP(Multi-Layer Perceptron) or FFN(Position-wise Feed-Forward Network) - Thinking Layer

1. Understanding nn.Linear

2. Full MLP Walkthrough

9. Residual Connections

The Solution: The Residual “Express Lane”

Walkthrough with Numbers

10. Layer Normalization

Graph View

2. Encapsulating the Logic in an `nn.Module`

2. Encapsulating in the `nn.Module`

2. Encapsulating in the `nn.Module`

1. Understanding `nn.Linear`