Instead of training a 1-billion-parameter weight matrix W, you freeze it. You then train two tiny matrices, A and B, that represent the change to W. LoRA is a piece of code, here it is:

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
 
class LoRALinear(nn.Module):
    def __init__(self, base: nn.Linear, r: int, alpha: float = 16.0):
        super().__init__()
        self.r = r
        self.alpha = alpha
        self.scaling = self.alpha / self.r
 
        # Freeze the original linear layer
        self.base = base
        self.base.weight.requires_grad_(False)
 
        # Create the trainable low-rank matrices
        self.lora_A = nn.Parameter(torch.empty(r, base.in_features))
        self.lora_B = nn.Parameter(torch.empty(base.out_features, r))
 
        # Initialize the weights
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B) # Start with no change
 
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Original path (frozen) + LoRA path (trainable)
        return self.base(x) + (F.linear(F.linear(x, self.lora_A), self.lora_B) * self.scaling)

nn.Linear Layer

The nn.Linear layer is a simple workhorse that performs the vast majority of computations in a Transformer. Its only job is to perform this equation: output = input @ W.T + b

For example, We’ll create a tiny linear layer that takes a vector of size 3 and outputs a vector of size 2. To make this perfectly clear, we will set the weights and bias manually

  1. Setup the layer and input:
    import torch
    import torch.nn as nn
     
    # A layer that maps from 3 features to 2 features
    layer = nn.Linear(in_features=3, out_features=2, bias=True)
     
    # A single input vector (with a batch dimension of 1)
    input_tensor = torch.tensor([[1., 2., 3.]])
     
    # Manually set the weights and bias for a clear example
    with torch.no_grad():
        layer.weight = nn.Parameter(torch.tensor([[0.1, 0.2, 0.3],
                                                  [0.4, 0.5, 0.6]]))
        layer.bias = nn.Parameter(torch.tensor([0.7, 0.8]))
  2. Inspecting the Exact Components:
    • Now we have known values for everything
      • Input x: [1., 2., 3.]
      • Weight W: [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]]
      • Bias b: [0.7, 0.8]
  3. The Forward Pass and Its Output: When we call layer(input_tensor), PyTorch computes the result
    # The forward pass
    output_tensor = layer(input_tensor)
     
    print("--- PyTorch Calculation ---")
    print("Input (x):", input_tensor)
    print("Weight (W):\n", layer.weight)
    print("Bias (b):", layer.bias)
    print("\nOutput (y):", output_tensor)
    • This will print:
    --- PyTorch Calculation ---
    Input (x): tensor([[1., 2., 3.]])
    Weight (W):
     tensor([[0.1000, 0.2000, 0.3000],
            [0.4000, 0.5000, 0.6000]], grad_fn=<CopySlices>)
    Bias (b): tensor([0.7000, 0.8000], grad_fn=<CopySlices>)
     
    Output (y): tensor([[2.1000, 4.7000]], grad_fn=<AddmmBackward0>)
    • The final output is the tensor [[2.1, 4.7]]
  4. Manual Verification: Step-by-Step
    • The calculation is x @ W.T + b
      • First, the matrix multiplication x @ W.T:
      • [1, 2, 3] @ [[0.1, 0.4], [0.2, 0.5], [0.3, 0.6]]
      • output[0] = (1*0.1) + (2*0.2) + (3*0.3) = 0.1 + 0.4 + 0.9 = 1.4
      • output[1] = (1*0.4) + (2*0.5) + (3*0.6) = 0.4 + 1.0 + 1.8 = 3.2
      • Result: [1.4, 3.2]
    • Second, add the bias + b:
      • [1.4, 3.2] + [0.7, 0.8]
      • Result: [2.1, 4.7]

The manual calculation matches the PyTorch output exactly. This is all a linear layer does

The Scaling Problem

  • Our Toy Layer (3x2):
    • Weight parameters: 3 * 2 = 6
    • Bias parameters: 2
    • Total: 8 trainable parameters
  • A Single LLM Layer (e.g., 4096x4096):
    • Weight parameters: 4096 * 4096 = 16,777,216
    • Bias parameters: 4096
    • Total: 16,781,312 trainable parameters

A single layer in an LLM can have over 16 million parameters. A full model has dozens of these layers. Trying to update all of them during fine-tuning is what melts GPUs. This is the bottleneck LoRA is designed to break

LoRA

This is the core idea. Instead of changing the massive weight matrix , we freeze it and learn a tiny “adjustment” matrix,
The new, effective weight matrix, , is a simple sum: Training the full would be too expensive. The breakthrough of LoRA is to force this change to be low-rank, meaning we can construct it from two much smaller matrices, and . We also add a scaling factor, , where is the rank and is a hyperparameter

The full LoRA update is defined by this formula:

Building a tiny LoRA update from scratch Given:

  • A frozen weight matrix of shape [out=4, in=3]
  • A LoRA rank
  • A scaling factor Now, we define our trainable LoRA matrices, and :
  • must have shape [r, in], so [2, 3]
  • must have shape [out, r], so [4, 2]

Let’s assume after training they have these values: Step 1: Calculate the core update, This is a standard matrix multiplication. The result will have the same shape as

Step 2: Apply the scaling factor, Our scaling factor is . We multiply our result by this scalar This matrix is the total change that our LoRA parameters will apply to the frozen weights.

Step 3: The “Merge” for Inference After training is done, we can create the final, effective weight matrix by adding the frozen weights and the LoRA update This final matrix is what you would use for deployment. Crucially, this merge calculation happens only once after training. For inference, it’s just a standard linear layer, adding zero extra latency

The Forward Pass (How it works during training):
During training, we never compute the full . That would be inefficient. Instead, we use the decomposed form, which is much faster. The forward pass is: Let’s compute this with an input :

  1. LoRA Path (right side):
    • Ax =
    • B(Ax) =
    • Scale it: 2 *
  2. Frozen Path (left side):
    • W_frozen * x =
  3. Final Output:
    • y =
MethodTrainable ParametersCalculationParameter Reduction
Full Fine-Tuning16,777,2164096 * 40960%
LoRA (r=8)65,536(8 * 4096) + (4096 * 8)99.61%
By performing the efficient forward pass during training, we only need to store and update the parameters for the tiny A and B matrices, achieving a >99% parameter reduction while still being able to modify the behavior of the massive base layer

LoRALinear Module

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
 
class LoRALinear(nn.Module):
    def __init__(self, base: nn.Linear, r: int, alpha: float = 16.0):
        super().__init__()
        # --- Store hyperparameters ---
        self.r = r
        self.alpha = alpha
        self.scaling = self.alpha / self.r
 
        # --- Store and freeze the original linear layer ---
        self.base = base
        self.base.weight.requires_grad_(False)
        # Also freeze the bias if it exists
        if self.base.bias is not None:
            self.base.bias.requires_grad_(False)
 
        # --- Create the trainable LoRA matrices A and B ---
        # A has shape [r, in_features]
        # B has shape [out_features, r]
        self.lora_A = nn.Parameter(torch.empty(r, self.base.in_features))
        self.lora_B = nn.Parameter(torch.empty(self.base.out_features, r))
 
        # --- Initialize the weights ---
        # A is initialized with a standard method
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        # B is initialized with zeros
        nn.init.zeros_(self.lora_B)
 
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # 1. The original, frozen path
        base_output = self.base(x)
 
        # 2. The efficient LoRA path: B(A(x))
        # F.linear(x, self.lora_A) computes x @ A.T
        # F.linear(..., self.lora_B) computes (x @ A.T) @ B.T
        lora_update = F.linear(F.linear(x, self.lora_A), self.lora_B) * self.scaling
 
        # 3. Return the combined output
        return base_output + lora_update
  1. __init__(self, base, r, alpha):
    • It accepts the original nn.Linear layer (base) that we want to adapt
    • self.base.weight.requires_grad_(False): Here we tell PyTorch’s autograd engine not to compute gradients for the original weights, so they will never be updated by the optimizer
    • nn.Parameter(...): We register lora_A and lora_B as official trainable parameters of the module. Their shapes are derived directly from the base layer and the rank r
    • nn.init.zeros_(self.lora_B): By starting B as a zero matrix, the entire LoRA update (B @ A) is zero at the beginning of training. This means our LoRALinear layer initially behaves exactly like the original frozen layer, and the model learns the “change” from a stable starting point
  2. forward(self, x):
    • This is a direct translation of the formula:
    • We compute the output of the frozen path and the LoRA path separately
    • The nested F.linear calls are a highly efficient PyTorch way to compute (x @ A.T) @ B.T without ever forming the full matrix
    • Finally, we add them together

Applying LoRA to a Model

Now we need a helper function to swap out the nn.Linear layers in any given model with our new LoRALinear layer

def apply_lora(model: nn.Module, r: int, alpha: float = 16.0):
    """
    Replaces all nn.Linear layers in a model with LoRALinear layers.
    """
    for name, module in list(model.named_modules()):
        if isinstance(module, nn.Linear):
            # Find the parent module to replace the child
            parent_name, child_name = name.rsplit('.', 1)
            parent_module = model.get_submodule(parent_name)
 
            # Replace the original linear layer
            setattr(parent_module, child_name, LoRALinear(module, r=r, alpha=alpha))

Full Implementation

  1. Create a toy model:
    model = nn.Sequential(
        nn.Linear(128, 256),
        nn.ReLU(),
        nn.Linear(256, 10) # e.g., for classification
    )
  2. Inject LoRA layers:
    apply_lora(model, r=8, alpha=16.0)
    print(model)

The output will show that our nn.Linear layers have been replaced by LoRALinear

  1. Isolate the Trainable Parameters:
    • We create an optimizer that only sees the LoRA weights
    # Filter for parameters that require gradients (only lora_A and lora_B)
    trainable_params = [p for p in model.parameters() if p.requires_grad]
    trainable_param_names = [name for name, p in model.named_parameters() if p.requires_grad]
     
    print("\nTrainable Parameters:")
    for name in trainable_param_names:
        print(name)
     
    # Create an optimizer that only updates the LoRA weights
    optimizer = torch.optim.AdamW(trainable_params, lr=1e-4)
    • Output:
    Trainable Parameters:
    0.lora_A
    0.lora_B
    2.lora_A
    2.lora_B

The optimizer is completely unaware of the massive, frozen weights (0.base.weight, 2.base.weight, etc.) and will only update our tiny, efficient LoRA matrices

Q Where does LoRA actually go in an LLM?
A The nn.Linear layers we’ve been working with are the primary components of a Transformer. When you apply LoRA to a model like Llama or Mistral, you are targeting these specific linear layers:

  • Self-Attention Layers: The most common targets are the projection matrices for the query (q_proj) and value (v_proj). Adapting these allows the model to change what it pays attention to in the input text, which is incredibly powerful for task-specific fine-tuning
  • Feed-Forward Layers (MLP): Transformers also have blocks of linear layers that process information after the attention step. Applying LoRA here helps modify the model’s learned representations and knowledge

So, when you see a LoRA implementation for a real LLM, the apply_lora function is simply more selective, replacing only the linear layers named q_proj, v_proj, etc., with the LoRALinear module you just built

Q Why This Works So Well ?
A The knowledge needed to adapt a pre-trained model to a new task is much simpler than the model’s entire knowledge base. We don’t need to re-learn the entire English language to make a model a better chatbot. You only need to steer its existing knowledge. This “steering” information lies in a low-dimensional space, which a low-rank update ΔW = B @ A can capture perfectly