LoRA

Instead of training a 1-billion-parameter weight matrix W, you freeze it. You then train two tiny matrices, A and B, that represent the change to W. LoRA is a piece of code, here it is:

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
 
class LoRALinear(nn.Module):
    def __init__(self, base: nn.Linear, r: int, alpha: float = 16.0):
        super().__init__()
        self.r = r
        self.alpha = alpha
        self.scaling = self.alpha / self.r
 
        # Freeze the original linear layer
        self.base = base
        self.base.weight.requires_grad_(False)
 
        # Create the trainable low-rank matrices
        self.lora_A = nn.Parameter(torch.empty(r, base.in_features))
        self.lora_B = nn.Parameter(torch.empty(base.out_features, r))
 
        # Initialize the weights
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B) # Start with no change
 
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Original path (frozen) + LoRA path (trainable)
        return self.base(x) + (F.linear(F.linear(x, self.lora_A), self.lora_B) * self.scaling)

nn.Linear Layer

The nn.Linear layer is a simple workhorse that performs the vast majority of computations in a Transformer. Its only job is to perform this equation: output = input @ W.T + b

For example, We’ll create a tiny linear layer that takes a vector of size 3 and outputs a vector of size 2. To make this perfectly clear, we will set the weights and bias manually

Setup the layer and input:

import torch
import torch.nn as nn
 
# A layer that maps from 3 features to 2 features
layer = nn.Linear(in_features=3, out_features=2, bias=True)
 
# A single input vector (with a batch dimension of 1)
input_tensor = torch.tensor([[1., 2., 3.]])
 
# Manually set the weights and bias for a clear example
with torch.no_grad():
    layer.weight = nn.Parameter(torch.tensor([[0.1, 0.2, 0.3],
                                              [0.4, 0.5, 0.6]]))
    layer.bias = nn.Parameter(torch.tensor([0.7, 0.8]))

Inspecting the Exact Components:
- Now we have known values for everything
  - Input x: [1., 2., 3.]
  - Weight W: [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]]
  - Bias b: [0.7, 0.8]

The Forward Pass and Its Output: When we call layer(input_tensor), PyTorch computes the result

# The forward pass
output_tensor = layer(input_tensor)
 
print("--- PyTorch Calculation ---")
print("Input (x):", input_tensor)
print("Weight (W):\n", layer.weight)
print("Bias (b):", layer.bias)
print("\nOutput (y):", output_tensor)

This will print:

--- PyTorch Calculation ---
Input (x): tensor([[1., 2., 3.]])
Weight (W):
 tensor([[0.1000, 0.2000, 0.3000],
        [0.4000, 0.5000, 0.6000]], grad_fn=<CopySlices>)
Bias (b): tensor([0.7000, 0.8000], grad_fn=<CopySlices>)
 
Output (y): tensor([[2.1000, 4.7000]], grad_fn=<AddmmBackward0>)

The final output is the tensor [[2.1, 4.7]]

Manual Verification: Step-by-Step
- The calculation is x @ W.T + b
  - First, the matrix multiplication x @ W.T:
  - [1, 2, 3] @ [[0.1, 0.4], [0.2, 0.5], [0.3, 0.6]]
  - output[0] = (1*0.1) + (2*0.2) + (3*0.3) = 0.1 + 0.4 + 0.9 = 1.4
  - output[1] = (1*0.4) + (2*0.5) + (3*0.6) = 0.4 + 1.0 + 1.8 = 3.2
  - Result: [1.4, 3.2]
- Second, add the bias + b:
  - [1.4, 3.2] + [0.7, 0.8]
  - Result: [2.1, 4.7]

The manual calculation matches the PyTorch output exactly. This is all a linear layer does

The Scaling Problem

Our Toy Layer (3x2):
- Weight parameters: 3 * 2 = 6
- Bias parameters: 2
- Total: 8 trainable parameters
A Single LLM Layer (e.g., 4096x4096):
- Weight parameters: 4096 * 4096 = 16,777,216
- Bias parameters: 4096
- Total: 16,781,312 trainable parameters

A single layer in an LLM can have over 16 million parameters. A full model has dozens of these layers. Trying to update all of them during fine-tuning is what melts GPUs. This is the bottleneck LoRA is designed to break

This is the core idea. Instead of changing the massive weight matrix $W$ , we freeze it and learn a tiny “adjustment” matrix, $Δ W$
The new, effective weight matrix, $W_{e ff}$ , is a simple sum: $W_{e ff} = W_{f roze n} + Δ W$ Training the full $Δ W$ would be too expensive. The breakthrough of LoRA is to force this change to be low-rank, meaning we can construct it from two much smaller matrices, $A$ and $B$ . We also add a scaling factor, $\frac{α}{r}$ , where $r$ is the rank and $α$ is a hyperparameter

The full LoRA update is defined by this formula: $Δ W = \frac{α}{r} B A$

Building a tiny LoRA update from scratch Given:

A frozen weight matrix $W_{f roze n}$ of shape [out=4, in=3]
A LoRA rank $r = 2$
A scaling factor $α = 4$ $W_{f roze n} = 123412341234$ Now, we define our trainable LoRA matrices, $A$ and $B$ :
$A$ must have shape [r, in], so [2, 3]
$B$ must have shape [out, r], so [4, 2]

Let’s assume after training they have these values: $A = (100320) B = 10010021$ Step 1: Calculate the core update, $B A$ This is a standard matrix multiplication. The result will have the same shape as $W_{f roze n}$ $B A = 10010021 (100320) = (1 * 1 + 0 * 0) (0 * 1 + 0 * 0) (0 * 1 + 2 * 0) (1 * 1 + 1 * 0) (1 * 0 + 0 * 3) (0 * 0 + 0 * 3) (0 * 0 + 2 * 3) (1 * 0 + 1 * 3) (1 * 2 + 0 * 0) (0 * 2 + 0 * 0) (0 * 2 + 2 * 0) (1 * 2 + 1 * 0) = 100100632002$

Step 2: Apply the scaling factor, $\frac{α}{r}$ Our scaling factor is $\frac{4}{2} = 2$ . We multiply our result by this scalar $Δ W = 2 \times 100100632002 = 2002001264004$ This $Δ W$ matrix is the total change that our LoRA parameters will apply to the frozen weights.

Step 3: The “Merge” for Inference After training is done, we can create the final, effective weight matrix by adding the frozen weights and the LoRA update $W_{e ff} = W_{f roze n} + Δ W = 123412341234 + 2002001264004 = 32361215105238$ This final $W_{e ff}$ matrix is what you would use for deployment. Crucially, this merge calculation happens only once after training. For inference, it’s just a standard linear layer, adding zero extra latency

The Forward Pass (How it works during training):
During training, we never compute the full $Δ W$ . That would be inefficient. Instead, we use the decomposed form, which is much faster. The forward pass is: $y = W_{f roze n} x + \frac{α}{r} B (A x)$ Let’s compute this with an input $x = 123$ :

LoRA Path (right side):
- Ax = $(100320) 123 = ((1 * 1 + 0 * 2 + 2 * 3) (0 * 1 + 3 * 2 + 0 * 3)) = (76)$
- B(Ax) = $10010021 (76) = (1 * 7 + 0 * 6) (0 * 7 + 0 * 6) (0 * 7 + 2 * 6) (1 * 7 + 1 * 6) = 701213$
- Scale it: 2 * $701213 = 1402426$
Frozen Path (left side):
- W_frozen * x = $123412341234123 = (1 + 2 + 3) (2 + 4 + 6) (3 + 6 + 9) (4 + 8 + 12) = 6121824$
Final Output:
- y = $6121824 + 1402426 = 20124250$

Method	Trainable Parameters	Calculation	Parameter Reduction
Full Fine-Tuning	16,777,216	`4096 * 4096`	0%
LoRA (r=8)	65,536	`(8 * 4096) + (4096 * 8)`	99.61%
By performing the efficient forward pass during training, we only need to store and update the parameters for the tiny `A` and `B` matrices, achieving a >99% parameter reduction while still being able to modify the behavior of the massive base layer

LoRALinear Module

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
 
class LoRALinear(nn.Module):
    def __init__(self, base: nn.Linear, r: int, alpha: float = 16.0):
        super().__init__()
        # --- Store hyperparameters ---
        self.r = r
        self.alpha = alpha
        self.scaling = self.alpha / self.r
 
        # --- Store and freeze the original linear layer ---
        self.base = base
        self.base.weight.requires_grad_(False)
        # Also freeze the bias if it exists
        if self.base.bias is not None:
            self.base.bias.requires_grad_(False)
 
        # --- Create the trainable LoRA matrices A and B ---
        # A has shape [r, in_features]
        # B has shape [out_features, r]
        self.lora_A = nn.Parameter(torch.empty(r, self.base.in_features))
        self.lora_B = nn.Parameter(torch.empty(self.base.out_features, r))
 
        # --- Initialize the weights ---
        # A is initialized with a standard method
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        # B is initialized with zeros
        nn.init.zeros_(self.lora_B)
 
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # 1. The original, frozen path
        base_output = self.base(x)
 
        # 2. The efficient LoRA path: B(A(x))
        # F.linear(x, self.lora_A) computes x @ A.T
        # F.linear(..., self.lora_B) computes (x @ A.T) @ B.T
        lora_update = F.linear(F.linear(x, self.lora_A), self.lora_B) * self.scaling
 
        # 3. Return the combined output
        return base_output + lora_update

__init__(self, base, r, alpha):
- It accepts the original nn.Linear layer (base) that we want to adapt
- self.base.weight.requires_grad_(False): Here we tell PyTorch’s autograd engine not to compute gradients for the original weights, so they will never be updated by the optimizer
- nn.Parameter(...): We register lora_A and lora_B as official trainable parameters of the module. Their shapes are derived directly from the base layer and the rank r
- nn.init.zeros_(self.lora_B): By starting B as a zero matrix, the entire LoRA update (B @ A) is zero at the beginning of training. This means our LoRALinear layer initially behaves exactly like the original frozen layer, and the model learns the “change” from a stable starting point
forward(self, x):
- This is a direct translation of the formula: $y = W_{f roze n} x + \frac{α}{r} B (A x)$
- We compute the output of the frozen path and the LoRA path separately
- The nested F.linear calls are a highly efficient PyTorch way to compute (x @ A.T) @ B.T without ever forming the full $Δ W$ matrix
- Finally, we add them together

Applying LoRA to a Model

Now we need a helper function to swap out the nn.Linear layers in any given model with our new LoRALinear layer

def apply_lora(model: nn.Module, r: int, alpha: float = 16.0):
    """
    Replaces all nn.Linear layers in a model with LoRALinear layers.
    """
    for name, module in list(model.named_modules()):
        if isinstance(module, nn.Linear):
            # Find the parent module to replace the child
            parent_name, child_name = name.rsplit('.', 1)
            parent_module = model.get_submodule(parent_name)
 
            # Replace the original linear layer
            setattr(parent_module, child_name, LoRALinear(module, r=r, alpha=alpha))

Full Implementation

Create a toy model:

model = nn.Sequential(
    nn.Linear(128, 256),
    nn.ReLU(),
    nn.Linear(256, 10) # e.g., for classification
)

Inject LoRA layers:

apply_lora(model, r=8, alpha=16.0)
print(model)

The output will show that our nn.Linear layers have been replaced by LoRALinear

Isolate the Trainable Parameters:

We create an optimizer that only sees the LoRA weights

# Filter for parameters that require gradients (only lora_A and lora_B)
trainable_params = [p for p in model.parameters() if p.requires_grad]
trainable_param_names = [name for name, p in model.named_parameters() if p.requires_grad]
 
print("\nTrainable Parameters:")
for name in trainable_param_names:
    print(name)
 
# Create an optimizer that only updates the LoRA weights
optimizer = torch.optim.AdamW(trainable_params, lr=1e-4)

Output:

Trainable Parameters:
0.lora_A
0.lora_B
2.lora_A
2.lora_B

The optimizer is completely unaware of the massive, frozen weights (0.base.weight, 2.base.weight, etc.) and will only update our tiny, efficient LoRA matrices

Q Where does LoRA actually go in an LLM?
A The nn.Linear layers we’ve been working with are the primary components of a Transformer. When you apply LoRA to a model like Llama or Mistral, you are targeting these specific linear layers:

Self-Attention Layers: The most common targets are the projection matrices for the query (q_proj) and value (v_proj). Adapting these allows the model to change what it pays attention to in the input text, which is incredibly powerful for task-specific fine-tuning
Feed-Forward Layers (MLP): Transformers also have blocks of linear layers that process information after the attention step. Applying LoRA here helps modify the model’s learned representations and knowledge

So, when you see a LoRA implementation for a real LLM, the apply_lora function is simply more selective, replacing only the linear layers named q_proj, v_proj, etc., with the LoRALinear module you just built

Q Why This Works So Well ?
A The knowledge needed to adapt a pre-trained model to a new task is much simpler than the model’s entire knowledge base. We don’t need to re-learn the entire English language to make a model a better chatbot. You only need to steer its existing knowledge. This “steering” information lies in a low-dimensional space, which a low-rank update ΔW = B @ A can capture perfectly

Sadiq's Knowledge Vaults

Explorer

LoRA

nn.Linear Layer

The Scaling Problem

LoRA

LoRALinear Module

Applying LoRA to a Model

Full Implementation

Graph View