1. Introduction

Reference Book: Practical Machine Learning for Computer Vision by Valliappa Lakshmanan, Martin Gorner & Ryan Gillard
Associated GitHub Repo: Practical-ML-Vision-Book

Difference between Computer Vision and Machine Vision

Computer Vision:
- Enabling Machines to understand images and videos
- Used in Facial Recognition, self-driving cars, and medical image analysis
- For example, to recognize a given images is apple or not, we use AI to determine, is there apple present in it or not
Machine Vision:
- Practical, industrial use of vision based systems, often used in manufacturing and quality control systems
- Typically integrate cameras, lighting, sensors and specialized software to inspect products, guide robots and monitor processed in real time
- For example, to recognize a given images is apple or not, we use hardware to analyze pixels in image and check whether there is an set of rounded pixels present in the apple image or we take a matrix and use it as a filter to the other image input matrix

The Machine vision is very hard to pull of, but even we did, there will be some edge case that gets violated, where as ML Computer Vision can be trained and can learn to identify the media, even here edge cases exist, but they are few and can be learnt how to deal them rather than an human writing code for the edge case
The breakthrough in ML Computer vision came after AlexNet Paper which is released in 2012, which also won ImageNet Competition with just an error rate of 15.3%. After this point Computer Vision is based more on Deep Learning more than Machine Learning. Later this got expanded to NLP, Robotics,…
Note:

Difference between Deep Learning and Machine Learning
- In Machine Learning we perform feature extraction and later classification separately which gives the output
- But in Deep learning the feature extraction and classification will be done together with the help of deep learning model we built, if the output is good then we say the loss is less, and if the output is bad we say there is a loss in output
Difference between Shallow neural network and Deep neural network
- Shallow neural network - max 1-2 hidden neural networks
- Deep neural networks - many hidden neural networks

Image Filter and Convolution

Before the Deep Learning techniques being used in Computer vision, to check a particular thing present in the image or not we perform a matrix multiplication on the image, and here the matrix we take to multiply with input matrix contains the part of the pixels we need and we call it as “Kernel”

Kernel Size - It is one of the input matrix size which acts as an filter to input matrix
Stride - The number of pixels a convolutional filter (or kernel) moves as it slides across the input image or feature map during the convolution operation
Padding - The addition of extra pixels, typically zeros, around the borders of an input image before a convolution operation is performed
Dilation - A technique that expands the kernel’s receptive field by introducing gaps or “holes” between its elements, allowing it to process a larger input area without increasing the number of parameters or decreasing output resolution

graph TB
    subgraph "Step 1: Input Matrix (5x5)"
        A["| 75 | 80 | 80 | 80 | 75 |
         | 0  | 75 | 80 | 80 | 0  |
         | 0  | 75 | 80 | 80 | 0  |
         | 0  | 75 | 80 | 80 | 0  |
         | 0  | 0  | 0  | 0  | 0  |"]
    end
    
    subgraph "Step 2: Add Padding (7x7)"
        B["| 0 | 0 | 0 | 0 | 0 | 0 | 0 |
         | 0 |75 |80 |80 |80 |75 | 0 |
         | 0 | 0 |75 |80 |80 | 0 | 0 |
         | 0 | 0 |75 |80 |80 | 0 | 0 |
         | 0 | 0 |75 |80 |80 | 0 | 0 |
         | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
         | 0 | 0 | 0 | 0 | 0 | 0 | 0 |"]
    end
    
    subgraph "Step 3: Kernel (3x3)"
        C["| -1 |  0 |  1 |
         | -1 |  0 |  1 |
         | -1 |  0 |  1 |"]
    end
    
    subgraph "Step 4: Convolution Positions"
        D1["Position (1,1):
          | 0  0  0 |   | -1  0  1 |
          | 0 75 80 | * | -1  0  1 |
          | 0  0 75 |   | -1  0  1 |
          = 155"]
          
        D2["Position (1,2):
          | 0  0  0 |   | -1  0  1 |
          |75 80 80 | * | -1  0  1 |
          | 0 75 80 |   | -1  0  1 |
          = 5"]
          
        D3["Position (1,3):
          | 0  0  0 |   | -1  0  1 |
          |80 80 80 | * | -1  0  1 |
          |75 80 80 |   | -1  0  1 |
          = 5"]
          
        D4["Position (1,4):
          | 0  0  0 |   | -1  0  1 |
          |80 80 75 | * | -1  0  1 |
          |80 80  0 |   | -1  0  1 |
          = -5"]
          
        D5["Position (1,5):
          | 0  0  0 |   | -1  0  1 |
          |80 75  0 | * | -1  0  1 |
          |80  0  0 |   | -1  0  1 |
          = -155"]
    end
    
    subgraph "Step 5: Output Matrix (5x5)"
        E["| 155 |  5  |  5  | -5  |-155|
         |  75 |  0  |  0  |  0  | -75 |
         |  75 |  0  |  0  |  0  | -75 |
         |  75 |  0  |  0  |  0  | -75 |
         |   0 | -75 |-80 |-80 |   0 |"]
    end
    
    subgraph "Parameters"
        F["📋 Convolution Settings:
         • Kernel Size: 3×3
         • Stride: 1
         • Padding: 1
         • Dilation: 1
         • Output Size: 5×5"]
    end
    
    A -->|Add Zero Padding| B
    B -->|Apply Kernel| C
    C -->|Slide & Compute| D1
    C -->|Slide & Compute| D2
    C -->|Slide & Compute| D3
    C -->|Slide & Compute| D4
    C -->|Slide & Compute| D5
    D1 --> E
    D2 --> E
    D3 --> E
    D4 --> E
    D5 --> E
    
    style A fill:#e3f2fd
    style B fill:#e8f5e8
    style C fill:#f3e5f5
    style E fill:#fff3e0
    style F fill:#f9f9f9

We use filters for following things

Feature Extraction: Edges, corners, textures
Noise Reduction: Gaussian blur smooths variations
Enhancement: Sharpen or boost contrast
Pattern Recognition: Pre-neural-net CV relied on hand-crafted filters

Edges correspond to abrupt intensity changes—i.e., derivatives

1D finite difference:

f^{'} (x) \approx f (x + 1) - f (x - 1)

Extend to 2D for images by computing ∂I/∂x and ∂I/∂y over pixel grids

Sobel Filter

Approximates first derivative + smoothing.
Horizontal (Gx):

G_{x} = - 1 - 2 - 1 000 + 1 + 2 + 1

Vertical (∂I/∂y) = transpose of Gx
Smoothing reduces noise by weighting the center row more heavily

graph LR
    subgraph Sobel_X
      A["-1  0  1"]  
      B["-2  0  2"]  
      C["-1  0  1"]  
    end
    A --> B --> C

Prewitt Filter

Simpler weights (no double center):

P_{x} = - 1 - 1 - 1 000 + 1 + 1 + 1 P_{y} = - 1 0 + 1 - 1 0 + 1 - 1 0 + 1

Second‐Derivative - Laplacian Filter

Highlights rapid intensity changes (sharper edges).
Kernel example:

L = 0 + 1 0 + 1 - 4 + 1 0 + 1 0

Sum of 4 neighbors minus 4× center approximates ∂²I/∂x² + ∂²I/∂y².

graph LR
    subgraph Laplacian
      X["0  1  0"]  
      Y["1 -4  1"]  
      Z["0  1  0"]  
    end
    X --> Y --> Z

Custom Filters

Identity:

000010000

Edge Detectors:
- Top edges: negative row at top, positive at center
- Bottom, left, right: analogous patterns
Outline Detection: Combine horizontal + vertical responses
Blurring: Uniform average filter (all weights = 1/9)
Denoising: Gaussian kernels of larger size (e.g., 5×5)

Filters in CNNs

CNNs learns filter values via backpropagation instead of hand-crafting
Early layers often resemble edge detectors and Gabor-like patterns
Learned filters detect low-level features (edges, textures), deeper layers capture complex patterns

Sadiq's Knowledge Vaults

Explorer