This Object Detection is of three types

Object Classification - In this we give a name to the image. e.g.: A Car
Image Localization - In this we locate the instance of an object category in the Image, Showing the Image using a bounding box. e.g.: Rounding of the car using a box in the Image
Object Detection - In this we locate all instances of all the classes inside the Image. This classification comes from a predefined set of categories in the dataset. e.g.: A bunch of cars and the Mountain behind it, all should be rounded in a bounding box

Our task is to design a model which draws the bounding box(a square/rectangle box indicating a feature/object) for the given Image and does Object Detection The bounding box can be any of this two, and the arrows show the co-ordinates Pasted image 20251114203758.png

Approach 1: CNN

Get the input image and draw all the possible squares/rectangles in that image and pass it to convolution and pooling layers, this will extract all the features from previous layers and have them as flatten tensors, from here we have two branches to calculate loss
1. These Tensors are passed to a fully connected classifier(4096 to 1000) at the end that will give the scores for each possible object and these are compared with correct label by passing through Supervised Learning Classifier, that gives us the Cross Entropy Loss
2. The same Tensors are also passed to a fully connected classifier(4096 to 4) at the end they will give us bounding box co-ordinates, these are compared with correct box labels by passing through Supervised Learning Classifier, this gives us L2 Loss
Cross Entropy Loss + L2 Loss = Total Loss
But this approach can only detect one object like if we give an image with multiple cars, it can only detect one car, this process of having ability to detect only one is called Object Localization

h = 1 \sum H w = 1 \sum W (W - w + 1) (H - h + 1) = \frac{H ( H + 1 ) W ( W + 1 )}{4}

Also it takes around(above formula) operations to get the result
- W, H - width and height of the Input Image
- w, h - width and height of the bounding box

Approach 2: R-CNN

This R-CNN is of 6 steps

Extract Region Proposals using Selective Search
- Rather exploring every possible bounding box in the given input image, we can use a external algorithm called Selective Search to narrow down the possible bounding boxes called region proposals(~around 2000 images), these region proposals are filtered if they have proposal overlap > 50%(IOU >= 0.5) if they do, then transformed into a square(227x227 for AlexNet compatibility) with co-ordinates (px, py, ph, pw) and dilate(zooming out in the same size) the bounding box, such that it have p pixels(p=16) as context, if the image is already fit, we add some padding and then dilute to get that p=16 value, and this completes our dataset
Train AlexNet(or VGGNet which gives higher accuracy but takes more time), on ImageNet Classification Dataset
- To Perform feature extraction we need a classifier, so we train a large Convolution network called AlexNet
- After this we remove the last classification layer and now we add a new classification layer which caters to classes present in our detection dataset
- Now if we have two classes to detect, e.g.: Person and Car (say n classes), then our classifier layer will have three output layers (n+1), two for classes and one in background
Fine tune CNN with resized proposals on classes of detection dataset and background class
- Here we divide the data into two batches
  1. Image which have ground truth(human labelled, correct bounding boxes) bounding boxes
  2. Image with region proposal(~2000 bounding boxes)
- Using these two batches we fine tune our Network work on categories present in our detection dataset plus the background task and train the whole network using cross-entropy loss, After fine tuning we remove the classification layer and use Fully Connected Layer output as feature representation of the proposals

Q What if we have two ground truth boxes by which a proposal overlaps. e.g.: A Dog in the laps of a human, here what do we label the bounding box as a human or a dog ?
A We consider the ground truth box that has higher overlap with the proposal, so in this as the human is big, we label it as a human

Train Binary Classifier(SVM) for each class and on fully connected layer representation of proposals
- SVM(Supervised Machine Learning) is a linear classifier which learns a decision boundary in 4096 dimensional space, so the person class SVM is going to learn to use this features to classify the proposals as Person or not a Person
- Now we train a Linear SVM for each class which takes this feature representation and learns to classify whether it’s a positive instance or a negative Instance of the Class
- Here we have a ground truth label, which we take as reference to label a input as positive or negative
  - Positive labelled proposals for class K → Ground truth Box
  - Negative labelled proposals for class K → Proposal boxes < 0.3 IOU with all instances of that class

Q As we have a fine tuned model in the 3rd step, why we need SVM then ? or As we have SVM which classifies each class on fc layer representation of proposals in 4th step, why we need Fine tuned CNN in the 3rd step ?
A They both act as two layers of filters

Fine tuned CNN - gives multiple bounding boxes for the target, avoids overfitting
SVM - act as a Boolean filter for the boxes coming from fine tuned model

Train a class specific Bounding Box Regressor on top of proposal features
- We’re currently relying on the regional proposals coming from Selective Search algorithm, but there is a problem with it
  - Proposal slightly misses left edge of subject → box prediction will be inaccurate
  - Proposal too large around person → includes extra background
  - Proposal with wrong aspect ratio → box needs adjustment
- Input: Proposal box P with coordinates (px, py, pw, ph) where (px, py) = center, (pw, ph) = width/height Target: Ground truth box G with coordinates (gx, gy, gw, gh) Learn: Transformation T such that T(P) ≈ G
  
  T defined by parameters: (tx, ty, tw, th) Transformation:
  - gx ≈ px + pw . tx
  - gy ≈ py + ph . ty
  - gw ≈ pw . exp(tw)
  - gh ≈ ph . exp(th)
Filter prediction using NMS (Non-Maximum Suppression)
- After all previous steps, for a single object, we might have:
  - Multiple proposals (from Selective Search) hitting same object
  - Each proposal refined by bounding box regressor
  - Each passed SVM classification
- This leads to Multiple overlapping boxes for same object instance for example:
  - Person in image gets predicted as:
    - Box 1: (150, 200, 100, 150) - confidence 0.95
    - Box 2: (153, 202, 102, 148) - confidence 0.92
    - Box 3: (155, 205, 98, 145) - confidence 0.88
    - Box 4: (200, 300, 100, 150) - person wearing hat (different person, low confidence)
    - Box 5: (230, 290, 95, 140) - hat area (very low confidence)
- To solve this problem, we follow the NMS Algorithm

NMS is a post-processing technique that eliminates duplicate and overlapping bounding boxes, selecting only the most confident and relevant boxes corresponding to detected objects
The algorithm operates by iterating through predicted bounding boxes and uses two key metrics: confidence scores and an Intersection over Union (IOU) threshold
If two bounding boxes have an IOU overlap exceeding the defined threshold (typically 0.5), they are considered duplicates pointing to the same object
NMS keeps the box with the highest confidence score and suppresses (removes) all other overlapping boxes that have IOU greater than the threshold
If overlapping bounding boxes point to different objects, they are both retained since they represent distinct detections
The IOU threshold is a user-defined hyperparameter: a lower threshold results in fewer detections by suppressing more boxes, while a higher threshold may allow multiple detections for the same object
Mean Average Precision (mAP) is an evaluation metric used to assess overall model performance after detection and post-processing are complete. The mAP is then the average of these Average Precision (AP) values across all classes. The IOU metric is used within mAP calculations to determine whether a predicted box is considered a true positive or false positive
At Last we have a model, which had trained to predict the a given class objects in the given Input Image

Approach 3: Fast R-CNN

It’s simply making the R-CNN faster, there are three places we can optimize it

1. Object detection is Slow

There is a problem of slow inference because at test time, features are extracted from each object proposal in each test image, that means the detection with VGG16 takes 47sec/image (on Nvidia k40 GPU ~ Older GPU)
Let’s optimize it, Here we know that we do forward pass through the CNN for all the proposals for a given image, because we need feature representation for each proposal for later prediction layers. Instead of running the CNN separately on every proposal, we can run the CNN once on the entire image and then crop the proposal regions directly from the final feature map. However, these cropped regions have different spatial sizes, and flattening them produces feature vectors of varying dimensions, but we have a problem here, we cannot fed this into fully connected layers because they expect a fixed-size input
ROI Pooling solves this by taking each proposal’s feature-map region and converting it into a fixed-size output (e.g., 7×7). This ensures every proposal produces a uniform-length vector after flattening, allowing consistent input to the FC layers

2. Training is Multi Stage Pipeline

Rather than one training task, to train R-CNN we need to manage the entire pipeline of three sequential training tasks(fine tune CNN, SVM, bounding box regression)
Let’s optimize it, Rather than having 3 stages of training as an R-CNN, we take the flattened ROI pooling output, feed it to a few common Fully Connected Layers and then have two branches of Fully connected layers
1. Classification FC Layer - Responsible for predicting the class probabilities for the proposal for the K + 1 Classes including Background
2. Bounding Box FC Layer - It predicts 4K values, which are the four regression values for the K classes

3. Training is Expensive in Space and Time

For SVM and bounding box regressor training, features are extracted from each object proposal in each image(~2k region proposals) and written to disk

Implementation Details

Quantization and Coordinate Alignment Issues:
When mapping an ROI from image space to feature map space, we encounter a issue wrt non-integer boundaries
Setup:

Image size: 400 × 400
CNN stride: 40 → feature map becomes 10 × 10
ROI (image space):

(x_{img}, y_{img}, w, h) = (10,; 20,; 50,; 60)

Mapping to feature map:

x_{feat} = \frac{x _{img}}{S} = \frac{10}{40} = 0.25

Feature maps require integer indexing, so we quantize:

x_{feat} = ⌊ 0.25 ⌋ = 0

Misalignment introduced:

Some true ROI pixels get excluded
Some outside pixels get incorrectly included

This is an acceptable approximation in Fast R-CNN

Bin Division Using Floor/Ceil: For an ROI of size:

(h_{roi}, w_{roi})

divided into a grid:

(h_{pool}, w_{pool})

bin boundaries:

h_{start} (j) = ⌊ j \cdot \frac{h _{roi}}{h _{pool}} ⌋

h_{end} (j) = ⌈ (j + 1) \cdot \frac{h _{roi}}{h _{pool}} ⌉

Similarly:

w_{start} (i) = ⌊ i \cdot \frac{w _{roi}}{w _{pool}} ⌋

w_{end} (i) = ⌈ (i + 1) \cdot \frac{w _{roi}}{w _{pool}} ⌉

Example (8×7 ROI → 2×2 bins):

Bin height: $\frac{8}{2} = 4$
Bin width: $\frac{7}{2} = 3.5$ For bin (1,1):

h_{start} = ⌊ 1 \cdot 4 ⌋ = 4, h_{end} = ⌈ 2 \cdot 4 ⌉ = 8

w_{start} = ⌊ 1 \cdot 3.5 ⌋ = 3, w_{end} = ⌈ 2 \cdot 3.5 ⌉ = 7

for c in range(C):
    for j in range(h_pool):
        for i in range(w_pool):
 
            h_start = floor(j * bin_size_h)
            h_end   = ceil((j+1) * bin_size_h)
            w_start = floor(i * bin_size_w)
            w_end   = ceil((i+1) * bin_size_w)
 
            region = roi_features[c, h_start:h_end, w_start:w_end]
            output[c, j, i] = max(region)

Transition to FC Layers & Output Heads

Flattening:
ROI pooled feature map (VGG-16):

7 \times 7 \times 512 = 25088

FC6 / FC7 Architecture:

25088 \to 4096 \to 4096

Pretrained on ImageNet → reused here
Classification Head:

4096 \to (K + 1)

Output SoftMax:

p = (p_{0}, p_{1}, \dots, p_{K})

Bounding Box Regression Head:

4096 \to (4 K)

Transforms:

x^{'} = x + t_{x}^{(u)} w

y^{'} = y + t_{y}^{(u)} h

w^{'} = w \cdot e^{t_{w}^{(u)}}

h^{'} = h \cdot e^{t_{h}^{(u)}}

Multitask Loss (Joint Classification + Localization)

Full Loss: $$
\mathcal{L}(p, u, t^u, v)

\mathcal{L}{\text{cls}}(p, u)
+
\lambda \cdot [u \ge 1] \cdot \mathcal{L}{\text{loc}}(t^u, v)

\mathcal{L}_{\text{cls}}(p, u) = -\log p_u

R e g ress i o n L oss (S m oo t h L 1) :

\mathcal{L}{\text{loc}}=\sum{i\in{x,y,w,h}} \text{SmoothL1}(t_i^{(u)} - v_i)

S m oo t h L 1 :

\text{SmoothL1}(\Delta)=
\begin{cases}
0.5\Delta^2 & |\Delta| < 1 \
|\Delta| - 0.5 & \text{otherwise}
\end{cases}

R e g ress i o n o n l y f or f ore g ro u n d :

[u \ge 1]

##### Pretrained Initialization #Q Why ImageNet Pretraining? #A Generalizable low-level features \to faster training, higher accuracy Changes to be made: - Replace final max-pool \to ROI Pooling - Keep FC6, FC7 - Remove ImageNet classifier - Add detection heads (classification + bounding box) Weight Initialization: - Conv + FC6, FC7 \to pretrained - Heads \to random small Gaussian ##### Sampling & Mini-batches Batch Structure Example: - Images per batch: `N = 2` - Proposals per batch: `R = 64` IOU-Based Proposal Assignment - Foreground: `IOU > 0.5` - Background (hard negatives): `0.1 <= IOU <= 0.5` - Ignored: `IOU < 0.1` 25% Foreground Sampling Rule

\frac{FG}{Total} = 0.25

For 64 proposals: 16 FG and 48 BG (hard negatives) ##### Scale Invariance A CNN trained on 224\times224 ImageNet images sees objects at a canonical size. In the real world, the same car appears at vastly different pixel sizes: - Close to camera: 400\times300 pixels - 10 meters away: 150\times100 pixels - 50 meters away: 40\times30 pixels If the network only learned one scale, it would fail on distant objects. Scale invariance means the detector performs equally well regardless of object size _Approach 1: Brute-Force Single-Scale_: How it works: - Resize every training and test image to a fixed size, e.g., 600\times600 - Train the network on this canonical size - During inference, all images are resized to 600\times600 before detection - The network implicitly learns scale invariance from the data distribution (large objects in the dataset get squashed, small ones get enlarged) Limitations: - Distorts aspect ratios (a 1920\times1080 image becomes 1:1 square) - Small objects are down sampled further; large objects clipped - Some information loss due to nonuniform scaling _Approach 2: Multi-Scale Image Pyramid_: Core idea: Process each proposal at the scale where it appears closest to its natural 224\times224 size (ImageNet standard) 1. Building the Pyramid Create an image pyramid at multiple scales:

\text{Pyramid Scales} = {1.0,; 0.75,; 0.5,; 0.25} \times \text{original image size}

F or an 800 \times 600 ima g e : - S c a l e 1.0 : 800 \times 600 - S c a l e 0.75 : 600 \times 450 - S c a l e 0.5 : 400 \times 300 - S c a l e 0.25 : 200 \times 1502. P ro p os a l - t o - S c a l e A ss i g nm e n tF ore a c h p ro p os a l, co m p u t e i t s a re aa t e a c h sc a l e an d a ss i g n t o t h esc a l e w h ere t h e a re ai sc l oses tt o :

224^2 = 50{,}176

\lambda_{coord} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}^{obj}_{ij} \left[(x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2 \right]

- P e na l i zeserror in p re d i c t e d b o x ce n t ercoor d ina t es . O n l y a ppl i e d w h e nab o x co n t ain s an o bj ec t

\lambda_{coord} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}^{obj}_{ij} \left[(\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i} - \sqrt{\hat{h}_i})^2 \right]

- P e na l i zeserror in p re d i c t e d w i d t han d h e i g h t . Sq u a reroo t sre d u ce t h e im p a c t o f l a r g e b o x s i zes . A l so a ppl i e d o n l y f oro bj ec t - co n t ainin g b o x es, I t^{'} s m u lt i pl i e d b y l amb d a w hi c hin cre a se d t h e v a l u eo f t h e l oc a l i z a t i o na cc u r a cy 2. C o n f i d e n ce L oss - T hi s i s m e an tt o t r ain t h e m o d e lt oc a pt u re tw o a s p ec t s, w hi c ha re t o p rese n ceo f an o bj ec t an d a s w e ll a s h o w g oo df i tt h e p re d i c t e d b o x i s . S o w e w an tt h e hi g h v a l u e w i t h o v er l a pw i t h t h e g ro u n d t r u t h t a r g e t an d l o w er v a l u e w h e n t h e b o x co n t ain s an d o bj ec t b u t d oes n o t f i tw e ll an d Z ero w h e n t h ere i s n oo bj ec t, t h ere f ore t r ainin g t h e m o d e l a tt es tt im e, w h ere t h ere w i ll b e n o g ro u n d t r u t ha v ai l ab l e

\sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}^{obj}_{ij} (C_i - \hat{C}_i)

- P e na l i zesco n f i d e n cescoreerrors f or b o x es t ha t co n t ainan o bj ec t - C i ​ : g ro u n d - t r u t h co n f i d e n ce = 1 - C i^{​} : p re d i c t e d co n f i d e n ce (I o U - ba se d)

\lambda_{noobj} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}^{noobj}_{ij} (C_i - \hat{C}_i)

- P e na l i zesco n f i d e n ceerror f or b o x es w i t h o u t o bj ec t s - I t^{'} s m u lt i pl i e d b y l amb d a t ore d u ce t h e im p a c t o f o v er w h e l min g n e g a t i v es am pl es, t y p i c a l v a l u e i s 0.53. Cl a ss i f i c a t i o n L oss - T hi sco n t ain s t h e m o d e lt ha t s h o u l d p re d i c t ‘ o n e ‘ a s t h e v a l u e f or t ha t c l a ss an d ‘ zero ‘ f or t h ere mainin g c l a sses

\sum_{i=0}^{S^2} \mathbb{1}^{obj}{i} \sum{c \in classes} (p_i(c) - \hat{p}_i(c))

s_k = s_{\min} + \frac{s_{\max} - s_{\min}}{m - 1}(k - 1)

- T hi se n s u res t ha tt h e f i rs t f e a t u re ma p u sess ma ll an c h or b o x es (sc a l e 0.2) an d t h e l a s t u ses l a r g eo n es (sc a l e 0.9), w i t hin t er m e d ia t e ma p srece i v in g e v e n l ys p a ce d sc a l es . F ore x am pl e, w i t h 5 f e a t u re ma p s, t h esc a l ess m oo t h l y p ro g ress f ro m 0.2 \to 0.9, co v er in g t h ee n t i rer an g eo f o bj ec t s i zes inanima g e F ore a c h sc a l e (s_{k}), SS D d e f in es m u lt i pl e a s p ec t r a t i os, t y p i c a ll y :

a_r \in {1, 2, \frac{1}{2}, 3, \frac{1}{3}}

G i v e n sc a l e (s_{k}) an d a s p ec t r a t i o (a_{r}), t h e w i d t han d h e i g h t o f t h e d e f a u lt b o x a reco m p u t e d a s :

w = s_k \sqrt{a_r}

h = \frac{s_k}{\sqrt{a_r}}

T h ese f or m u l a s k ee pt h e a re a o f t h e b o x co n s t an t s in ce :

w \cdot h = s_k

T hi s gu a r an t eesco n s i s t e n t co v er a g e w hi l ec han g in g o n l y t h es ha p eo f t h e b o x_{E} x t r a B o x f or A s p ec tR a t i o = 1_{:} T o im p ro v eco v er a g e f or in t er m e d ia t esc a l es (es p ec ia ll y b e tw ee n tw o f e a t u re ma p s), SS D a dd s a s p ec ia l a dd i t i o na l b o x w i t ha s p ec t r a t i o 1 b u tw i t ha g eo m e t r i c m e an sc a l e :

s’k = \sqrt{s_k \cdot s{k+1}}

T hi se x t r ab o x e n s u ress m oo t h er t r an s i t i o n s b e tw ee n o bj ec t s i zes an d re d u ces d e t ec t i o n g a p s b e tw ee na d ja ce n t sc a l e s_{D} e f a u ltB o x C e n t er s_{:} E a c h f e a t u re ma p ce ll corres p o n d s t o a g r i d l oc a t i o nin t h e ima g e . F or a f e a t u re ma p o f s p a t ia l d im e n s i o n ‘ F x F ‘, t h ece n t ero f t h ece ll a t in d e x (i, j) i sco m p u t e d b y :

c_x = \frac{j + 0.5}{F} , c_y = \frac{i + 0.5}{F}

T h esecoor d ina t es a re n or ma l i ze d t o l i e in t h er an g e (0, 1) . T h u s, re g a r d l esso f ima g es i ze, t h e d e f a u lt b o x es a l i g n p er f ec tl y w i t h t h e g r i d s t r u c t u re d er i v e df ro m t h e ba c kb o n e I f a f e a t u re ma p o f s i ze ‘ H x W ‘ a ss i g n sK d e f a u lt b o x es p erce ll, t h e t o t a l n u mb ero f an c h ors p ro d u ce df ro mi t i s :

\text{anchors} = H \times W \times K

A ppl y in g t hi s t o SS D : - 38 \times 38 p ro d u ces ‘1444 x K_{1} ‘ an c h ors - 19 \times 19 p ro d u ces ‘361 x K_{2} ‘ - \dots - 1 \times 1 p ro d u ces ‘1 x K_{6} ‘ He n ce SS D g e n er a t es t h o u s an d so fd e f a u lt b o x es, e nab l in g d e n seco v er a g eo f o bj ec t h y p o t h ese s_{P} re d i c t i o n He a d s (Cl a ss i f i c a t i o n + R e g ress i o n)_{:} F ore a c h d e f a u lt b o x, SS D p re d i c t s : - C c l a ssscores (in c l u d in g ba c k g ro u n d) - 4 re g ress i o n v a l u es (ce n t ero ff se t s + w i d t h / h e i g h t o ff se t s) T h u s, t h eco n v - f i lt er d im e n s i o n s a re : - Cl a ss i f i c a t i o n l a yer :

\text{channels} = K \times C

- R e g ress i o n l a yer :

\text{channels} = K \times 4

T h ese l a yerso p er a t eco n v o l u t i o na ll yo v er t h e f e a t u re ma p, p ro d u c in g p re d i c t i o n s a t e v erys p a t ia ll oc a t i o n f or a ll d e f a u lt b o x e s_{B} o x R e g ress i o n P a r am e t er i z a t i o n_{:} SS D u ses t h ee x a c t s am e b o u n d in g b o x p a r am e t er i z a t i o na s RCNN / F a s t er - RCNN . F ore a c h d e f a u lt b o x an d g ro u n d t r u t h p ai r, t h e n e tw or k p re d i c t s - O ff se t in ce n t er - x - O ff se t in ce n t er - y - O ff se t in w i d t h - O ff se t inh e i g h tT h eseo ff se t s a re a ppl i e dd u r in g d eco d in g t oreco v er t h e f ina l b o u n d in g b o x_{S} S DDe f a u ltB o x M a t c hin g St r a t e g y_{:} I n p u t : - S e t o fd e f a u lt b o x esD = d_{1}, d_{2}, ..., d_{N} a cross a ll f e a t u re ma p s - S e t o f g ro u n d t r u t hb o x es G = g_{1}, g_{2}, ..., g_{M} - I o U t h res h o l d T = 0.5 - R a t i o f or ha r d - n e g a t i v e minin g R = 3 A l g or i t hm : 1. C o m p u t e I o U M a t r i x - F ore v ery d e f a u lt b o x (d_{i}) an d g ro u n d t r u t h (g_{j}), co m p u t e :

IoU_{i,j} = \frac{|d_i \cap g_j|}{|d_i \cup g_j|}

- Store as an `N x M` IoU matrix 2. Best-Match Guarantee (One-to-One Assignments) - For each ground truth box ( g_j ): 1. Find the default box ( d_i ) with highest overlap: $$ i^* = \arg\max_i IoU_{i,j} $$ 2. Mark ( d_{i ^ * } ) as foreground 3. Assign its class label = class of ( g_j ) 4. Assign its regression target = transform ( d_{i ^ * } \to g_j ) This ensures every ground truth has at least one matched default box 3. Threshold-Based Matching (One-to-Many Matches) - For every default box ( d_i ): 1. If $$ \max_j IoU_{i,j} \geq T $$ then mark ( d_i ) as foreground 2. Match it to $$ j^* = \arg\max_j IoU_{i,j} $$ 3. Assign class = class of g_{j ^ * } 4. Assign regression target = transform d_{i} → g_{j ^ * } This step catches _all sufficiently overlapping anchors_ 4. Remaining Boxes → Background - All default boxes not marked as foreground become background:

D_{\text{bg}} = D \setminus D_{\text{fg}}

- T h eserece i v e : - c l a ss l ab e l = ba c k g ro u n d - n ore g ress i o n l oss 5. H a r d N e g a t i v e M inin g - S in ce ba c k g ro u n d v a s tl yo u t n u mb ers f ore g ro u n d, SS Dse l ec t so n l y t h e ha r d es t n e g a t i v es 1. F ore a c hba c k g ro u n d b o x, co m p u t ec l a ss i f i c a t i o n l oss (co n f i d e n ce l oss) :

L_{\text{conf}}(d_i)

2. S or t ba c k g ro u n d b o x es in d ecre a s in g or d ero f (L_{co n f}) 3. C o m p u t e t h e ma x im u mn u mb ero f n e g a t i v es a ll o w e d :

K = R \times |D_{\text{fg}}|

4. Select only the top-K background boxes: $$ D_{\text{neg}} = \text{TopK}(D_{\text{bg}}, K) $$ 5. Ignore all other background boxes during training This fixes the imbalance and stabilizes learning 6. Final Training Set Construction - The network is trained on: - Foreground boxes: - classification loss - bounding box regression loss - Selected hard negatives: - classification loss only - All other boxes contribute zero loss _SSD Loss_: Input: - Predicted class scores for all default boxes - Predicted box regression offsets - Ground truth box assignments from SSD matching - Set of foreground boxes (D_{fg}) - Set of hard-negative boxes (D_{neg}) Loss Caluclations: 1. Compute Localization (Regression) Targets - For each foreground default box (d_i) matched to ground truth (g_j): - Compute transformation targets exactly like Faster R-CNN:

t_x = (g_x - d_x) / d_w, t_y = (g_y - d_y) / d_h, t_w = \log(g_w / d_w), t_h = \log(g_h / d_h)

T h ese f o u r v a l u es f or m t h ere g ress i o n t a r g e t s 2. C o m p u t e L oc a l i z a t i o n L oss - L oc a l i z a t i o n l oss i s a ppl i e d o n l yo n f ore g ro u n d b o x es - F ore a c hma t c h e d p ai r (d_{i}, g_{j}), co m p u t e S m oo t h L 1 o v er t h e 4 o ff se t s :

L_{loc} = \sum_{i \in D_{fg}} \sum_{m \in {x, y, w, h}} \text{SmoothL1}(p_i^m - t_i^m)

- Wh ere :

\text{SmoothL1}(z) = \begin{cases} 0.5z^2 & |z| < 1 \ |z| - 0.5 & \text{otherwise} \end{cases}

3. C o m p u t e Cl a ss i f i c a t i o n L oss - Cl a ss i f i c a t i o n l oss i s a ppl i e d o n : - A ll f ore g ro u n d b o x es - S e l ec t e d ha r d - n e g a t i v es - U seso f t ma x cross - e n t ro p y : Wh ere (c_{i}) i s t h ec l a sso f t h e ma t c h e d g ro u n d - t r u t hb o x

L_{conf} = \sum_{i \in D_{fg}} \text{CE}(p_i, c_i) + \sum_{i \in D_{neg}} \text{CE}(p_i, \text{background})

4. N or ma l i ze b y N u mb ero f F ore g ro u n d B o x es - SS D n or ma l i zes t o t a ll oss b y t h e n u mb ero f p os i t i v es :

N = |D_{fg}|

- F ina ll oss :

L = \frac{1}{N} ( L_{loc} + L_{conf} )

- I f n o f ore g ro u n d e x i s t s (r a recor n erc a se in so m e d a t a se t s), se t (N = 1) F ina lSS D M u lt i t a s k L oss :

L = \frac{1}{N} \left( \sum_{i \in D_{fg}} \sum_{m \in {x,y,w,h}} \text{SmoothL1}(p_i^m - t_i^m) + \sum_{i \in D_{fg} \cup D_{neg}} \text{CE}(p_i, c_i) \right)

_SSD Architecture_: ![[Pasted image 20251208194347.png]] 1. Base Network (VGG16 Backbone, truncated at conv5_3) - SSD uses VGG16 without fully-connected layers (fc → conv) - Reason: - FC layers make the model too large for dense prediction - Convolutional replacement preserves spatial layout → needed for detection - Output feature maps: 38×38 and 19×19, both with good spatial resolution - Purpose: detect small & medium objects using early & mid-depth feature maps 2. Multi-Scale Feature Maps (CNN pyramid) - SSD introduces extra convolutional layers after VGG: 10×10, 5×5, 3×3, 1×1 - Why add extra layers? - Each deeper layer has larger receptive fields, capturing larger objects - Architecture builds a feature pyramid by design, so SSD does not require FPN 3. Prediction Convolutions on Each Feature Map - Each selected feature map produces: - Classification scores - Regression offsets for default boxes - Using 3×3 conv filters sliding over the map - Why 3×3 conv for predictions? - Small, local window → efficient dense prediction - Performs the exact role of Region Proposal Networks (RPN) but in a single stage 4. Multi-Scale Default Boxes (Anchor Boxes) - Each feature map has: | Feature map size | Scale (s) | Aspect ratios | | ---------------- | --------- | ------------- | | 38×38 | 0.1 | 1, 2, 0.5 | | 19×19 | 0.2 | 1, 2, 0.5 | | 10×10 | 0.375 | 1, 2, 0.5 | | 5×5 | 0.55 | 1, 2, 0.5 | | 3×3 | 0.725 | 1, 2, 0.5 | | 1×1 | 0.9 | 1, 2, 0.5 | - Why these scales? - They tile the image with boxes covering small → large objects evenly - Spacing between 0.1 → 0.9 is linear, ensuring consistent receptive field progression - Why aspect ratios {1, 2, 0.5}? - Covers tall, wide, and square objects with minimal redundancy - Additional AR=1 box with scale √(s_k * s_{k+1}) improves coverage in-between scales 5. Why 38×38 Feature Map Exists? (Small Object Detection) - High-resolution feature map = large number of default boxes for small objects - Small objects do not appear clearly in deep layers, hence need early layers 6. Why 1×1 Feature Map Exists? (Very Large Objects) - 1×1 map corresponds to entire image receptive field - Designed to detect large, full-frame objects (cars close to camera, etc.) Total Default Boxes = 8732 SSD generates: - 38×38 → many small boxes - 19×19 → mid boxes - 10×10 → - 5×5 → - 3×3 → - 1×1 → largest boxes Total ≈ 8732 default boxes - Why so many? - Dense coverage without proposals - Entire detection is single-shot (no RPN, no second stage) - Code Implementation: [GitHub](https://github.com/explainingai-code/SSD-PyTorch) #### Approach 7: _YOLO V2_ _Betterments wrt YOLO V1_: 1. Batch Normalization in all convolutional layers - This brings more than 2% [[13. Object Detection#mean-average-precision-map|mAP]] improvement by improving the convergence 2. Fine Tuning the Classifier network on high resolution images - In YOLO V1 the backbone is usually trained on classification task where usually the image inputs are 224 x 224 - Later using the detection fine tuning we subject the pre trained layers of this model to inputs of 448 x 448 because a higher resolution input will allow us to extract minor details and be better at detecting small objects, but now during detection model need to adapt to this different resolution input, while trying to learn how to detect objects, to ensure that the network gets adapted into this higher resolution input prior to detection fine tuning, we take the network, train it on 224 x 224 inputs of imageNet and fine tune on imageNet itself but with 448 x 448 inputs for a few epochs, then once the network has adapted to this new resolution, we do out detection fine tuning as usual on 448 x 448 inputs, with this it gives us `mAP` close to 70% 3. Convolutional and anchor boxes similar to RPN of Faster RCNN - Our Final layer in YOLO V1 is a FC Layer, we now make the whole YOLO Model Convolution and now we also adopt the Anchor Box approach of the Faster R-CNN for making box predictions - Here to Predict anchor boxes we do following things - Remove FC Layers and replace them with convolution layers and final prediction convolution layers will be having `5B + C` output channels - With 448 x 448 image input, our final output will be of `7 x 7` and now we remove the pooling layer to have the final output to be `14 x 14` - We also change the `448 x 448` image to `416 x 416` image feature map to have odd number of grid cells along width and height - Predict localization offsets, class and objectness scores for every anchor box _Clustering for Prior Boxes in YOLO V2_: In faster R-CNN for anchor boxes in [[13. Object Detection#rpnregion-proposal-network|RPN]] and we didn't have any reason or explanation why 9 anchor boxes are chosen In YOLO V2, we use _K Means Clustering_ to pick prior boxes, here are the steps to do that - We plot all the ground truth boxes in VOC dataset so this graph has width around X-axis and height along Y-axis, normalizing them to have width between 0 and 1, meaning the ground truth box have the same width or height as the input image - For distance between cluster of K-means, we use `distance(box, centroid) = 1 - IOU(box, centroid)` - We perform this for a bunch of K-Values and for each of this compute average IOU between cluster centers after K means we get the ground truth boxes. And to compute the Average IOU, we take a ground truth box and compute IOU between the ground truth box and the closest centroid and after computing it for all the values we get the average IOU for that particular K - Also we need to find the best K value of the set of K we get, because - If it's higher K, we have higher average IOU, but also end up with the more complex model because we have more predictions per cell, this means more work and we have more predictions per cell so more time for the network to compute offset from these K prior NMS also becomes more computationally expensive - And if we consider, small values of the K, which do not even effectively represent the diversity of the boxes in the dataset - So we take a middle range K value which suits us the best wrt use case - So during training we pick the box with highest IOU with the ground truth that becomes the responsible box and YOLO V2 is trained to generate transformation parameters that transform this responsible prior box with target ground truth - In YOLO V2 the width and height transformations will be same wrt YOLO V1 but we have _Sigmoid Activation_ to constraint the output from 0 to 1 ![image.png](https://pimg.mohammadsadiq4950.workers.dev/gist/6da8b85eb589c05f3bf5a9d6a52c7a53/obsidian-upload-1765706512430.png) _Passthrough Layer in YOLO V2_: Here we use fine grained features for better localization on smaller objects for this we add - `13 x 13` Prediction Layer(Say C1) for prior boxes for large scale - And another `26 x 26` Prediction Layer(Say C2) for prior boxes of smaller scale - Here we use rearrange to move from `26 x 26` of C2 to `13 x 13` of C1, now as they're of same size we concatenate them to `(4C2 + C1) x 13 x 13` and this concatenate layer has lower as well as higher rest features to make better predictions for small scale objects ![image.png](https://pimg.mohammadsadiq4950.workers.dev/gist/b80654ad8098b66c87b68c1754ec6a91/obsidian-upload-1765707990735.png) _Multi Scale Training_: In YOLO V1 we resize the images to 448 x 448 at last, but in YOLO V2 we intake images of different resolutions, like a resolution in a batch and train them, and these different sizes of images gives different results - lower resolution images gives less accurate detection but are fast - high resolution images gives high accurate detection but are slow _YOLO V2 Architecture_: Wrt YOLO V1, it's just switching from Darknet-19 as the base network ![The-network-framework-of-our-model-Y-PD.png](https://pimg.mohammadsadiq4950.workers.dev/gist/468ae5bfd76293705c4cc8b4bd1b0598/obsidian-upload-1765708939280.png) _YOLO 9000_: Most of the detection datasets, cater to a small number of object categories, The main reason for this is that detection annotation is costly but the classification annotation on the other hand is relatively cheaper (that's why we have few images but wide variety of classes) To Solve this. Authors of YOLO V2 used joint training and detection using YOLO V2 we're able to detect more than 9000 categories, the idea here is to use large available classification of data to strengthen our object detector #Q How to train on both simultaneously when only 44 ImageNet classes appear in COCO? #A Inspired by WordNet structure, YOLO9000 builds a hierarchical tree of concepts ```text Entity | |———————————————|———————————————| Dog Cat Vehicle | | | Terrier Siamese Car | | | Norfolk Airedale Bengal Persian Ferrari ``` Key insight: Convert flat 1000 ImageNet classes into a WordTree with 1369 nodes (1000 leaf nodes + 369 parent nodes) Softmax at Hierarchical Levels: Instead of one global softmax, predict conditional probabilities at each node. At the "terrier" node, predict: - Pr(Norfolk terrier | terrier) - Pr(Airedale terrier | terrier) - Pr(Lakeland terrier | terrier) To get absolute probability: `Pr(Norfolk terrier) = Pr(terrier ∣ dog) × Pr(dog ∣ entity)` Joint Training Strategy: Using only 3 anchor boxes (instead of 5) to limit output size. For detection images (COCO): - Backpropagate full YOLOv2 loss: bounding box loss + objectness loss + classification loss For classification images (ImageNet): - Backpropagate only classification loss at or above the corresponding hierarchy level - Bounding box predictions are not updated This clever scheme allows the model to learn detection features from limited labeled data and generalize to 9000+ unseen classes using classification data _Loss Function_: YOLO9000 uses a composite loss function:

L = \lambda_{coord} , L_{bbox}

\lambda_{obj} , L_{objectness}
\lambda_{class} , L_{classification}

Bounding Box Loss (Coordinates):

L_{bbox} = \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{obj} \Big[ (x_i - \hat{x}_i)^2 - (y_i - \hat{y}_i)^2 - (\sqrt{w_i} - \sqrt{\hat{w}_i})^2 - (\sqrt{h_i} - \sqrt{\hat{h}_i})^2 \Big] $$ Note: The square root of width and height is used to penalize large and small bounding boxes equally Objectness Loss:

L_{obj} =
\sum_{i=0}^{S^2}
\sum_{j=0}^{B}
(C_i - \hat{C}_i)

Cl a ss i f i c a t i o n L oss : (O n l y f orce ll s t ha t co n t ain o bj ec t s)

L_{class} =
\sum_{i=0}^{S^2}
\mathbb{1}i^{obj}
\sum{c \in classes}
\big(p_i(c) - \hat{p}_i(c)\big)

I n d i c a t or F u n c t i o n :

\mathbb{1}_{ij}^{obj} =
\begin{cases}
1 & \text{if an object exists in cell } i \text{ for box } j \
0 & \text{otherwise}
\end{cases}

\gamma = \frac{1 - \text{keep_prob}}{\text{block_size}^2} \cdot \frac{\text{feat_size}^2}{(\text{feat_size} - \text{block_size} + 1)^2}

4 ._{S} m oo t h L ab e l in g_{:} T hi s i s a c han g e in g ro u n d t r u t h l ab e l s u se df orc l a ss i f i c a t i o n t r ainin g i t^{'} s a co mm o ni d e a t ha t i s m e an tt o t a c k l eo v er f i tt in g o f n e tw or k an d a c t a s a ‘ R e gu l a r i zer ‘ - St an d a r d c l a ss i f i c a t i o n u seso n e - h o tl ab e l s - C ross - e n t ro p y l oss p u s h es t h e m o d e lt o ma t c h t hi s ha r dd i s t r ib u t i o n - T o a pp ro a c h o n e - h o tt a r g e t s, t h e m o d e l in cre a ses l o g i t s f or t h ecorrec t c l a ss, l e a d in g t oo v erco n f i d e n tp re d i c t i o n s an d l a r g e w e i g h t s - L ab e l s m oo t hin g so f t e n s t h e t a r g e t d i s t r ib u t i o n, a c t in g a s a re gu l a r i zer - T h e C ore I d e a : I n s t e a d o f a ss i g nin g p ro babi l i t y ‘1‘ t o t h ecorrec t c l a ss an d ‘0‘ t oo t h ers : - R e d u ce t h ecorrec t c l a ss p ro babi l i t ys l i g h tl y - D i s t r ib u t e t h ere mainin g p ro babi l i t y ma ss u ni f or m l y a crosso t h erc l a sses - T hi s d i sco u r a g ese x t re m e l o g i t v a l u es an d im p ro v es g e n er a l i z a t i o n - ‘ q_{i} ‘ \to t a r g e t (l ab e l) d i s t r ib u t i o n - ‘ p_{i} ‘ \to p re d i c t e d p ro babi l i t y f orc l a ss ‘ i ‘

L = -\sum_i q_i \log p_i

- S o f t ma x P ro babi l i t y : - ‘ z_{i} ‘ \to l o g i t f orc l a ss ‘ i ‘

p_i = \frac{e^{z_i}}{\sum_j e^{z_j}}

- L ab e lS m oo t hin g F or m u l a - ‘ L S ‘ \to l ab e l s m oo t hin g f a c t or - ‘ K ‘ \to n u mb ero f c l a sses

y_{\text{smooth}} = y_{\text{true}} (1 - \text{LS}) + \frac{\text{LS}}{K}

- B a g o f Sp ec ia l s f or B a c kb o n e 1 ._{M} i s h A c t i v a t i o n_{:} M i s hi s a s m oo t h, n o n - m o n o t o ni c a c t i v a t i o n f u n c t i o n p ro p ose d a s a se l f - re gu l a r i z in g a lt er na t i v e t o R e LU an d Sw i s h, p a r t i c u l a r l ye ff ec t i v e in d ee p v i s i o nm o d e l s

\text{Mish}(x) = x \cdot \tanh(\text{softplus}(x))

\text{softplus}(x) = \ln(1 + e^x)

- F or l a r g e p os i t i v e ‘ x ‘ : - ‘ so f tpl u s (x) \approx x ‘ - ‘ t anh (x) \approx 1‘ - \Rightarrow ‘ M i s h (x) \approx x ‘ - F or l a r g e n e g a t i v e ‘ x ‘ : - ‘ so f tpl u s (x) \approx 0‘ - ‘ t anh (0) \approx 0‘ - \Rightarrow ‘ M i s h (x) \approx 0‘2 ._{C} ross - St a g e P a r t ia lC o nn ec t i o n s_{:} CSP co nn ec t i o n ss pl i t f e a t u re ma p s a crosss t a g es t ore d u ceco m p u t a t i o n, im p ro v e g r a d i e n t f l o w, an d a v o i dd u pl i c a t e g r a d i e n t in f or ma t i o n - I n t ro d u ce d u s in g De n se N e t a s ba c kb o n e, l a t ere x t e n d e d t o R es N e t - s t y l e b l oc k s, an d a d o pt e d inY O L O v 4 (CSP D a r kn e t 53) - C ore I d e a : - Spl i t in p u t f e a t u re ma p a l o n g t h ec hann e l d im e n s i o n - O n e p a r t g oes t h ro ug hab l oc k (e . g ., res i d u a l b l oc k) - Ot h er p a r t b y p a ssesco m p u t a t i o n - C o n c a t e na t e b o t h p a r t s an d a ppl y a t r an s i t i o n l a yer - CSPR es i d u a lBl oc k M a t h : 1. T h e in p u t f e a t u re ma p b e :

X \in \mathbb{R}^{H \times W \times C}

2. C hann e lSpl i t

X_1, X_2 = \text{Split}(X), \quad C_1 = C_2 = \frac{C}{2}

3. R es i d u a lP rocess in g

Y = \text{ResBlock}(X_1)

4. T r an s i t i o n L a yer (B e f ore C o n c a t e na t i o n)

Y’ = \text{Conv}_{1\times1}(Y)

5. C o n c a t e na t i o n

Z = \text{Concat}(Y’, X_2)

6. F ina lT r an s i t i o n L a yer

\text{Output} = \text{Conv}_{1\times1}(Z)

- CSP D a r kn e t 53 (Y O L O v 4 B a c kb o n e) - CSP D a r kn e t 53 = D a r kn e t 53 + CSP b l oc k s - O r i g ina l D a r kn e t b l oc k : Do w n s am pl e \to R es i d u a l b l oc k s \to Do w n s am pl e - CSP v ers i o n : - Do w n s am pl e \to Spl i t - O n e p a r t \to R es i d u a l b l oc k s \to 1 \times 1 C o n v - Ot h er p a r t \to B y p a ss - C o n c a t e na t e \to 1 \times 1 C o n v \to Do w n s am pl e -! [ima g e . p n g] (h ttp s : // p im g . m o hamma d s a d i q 4950. w or k ers . d e v / g i s t / b 07 a 9 d 4155 a 2 d a 7 d 3 f 645 aba 0477 fd b 4/ o b s i d ian - u pl o a d - 1765902191442. p n g) 3 ._{M} u lt i I n p u t W e i g h t e d R es i d u a lC o nn ec t i o n s_{:} T hi s t ec hni q u e i s d es i g n e df orcross - sc a l e f e a t u re f u s i o n, i . e ., co mbinin g f e a t u re ma p so fd i ff ere n t s p a t ia l reso l u t i o n s, a s u se d in t h e E ff i c i e n t De t n ec k (B i FPN) - C ore I d e a - E a c hin p u t f e a t u re i s a ss i g n e d a l e a r nab l e w e i g h t - T h e n e tw or k l e a r n s f e a t u re im p or t an ce d u r in g t r ainin g - S u pp or t s : - P er - f e a t u re w e i g h t s (o n esc a l a r p er in p u t) - P er - c hann e lw e i g h t s (o n e w e i g h tp erc hann e l) - G e n er a lF u s i o n F or m u l a : F or ‘ N ‘ in p u t f e a t u res :

Y = \sum_{i=1}^{N} w_i \cdot X_i

- ‘ X_{i} ‘ \to in p u t f e a t u re ma p s (a l i g n e d in reso l u t i o n) - ‘ w_{i} ‘ \to l e a r nab l e w e i g h t s - F u s i o nVa r ian t s 1. U nb o u n d e d F u s i o n

Y = \sum_{i=1}^{N} w_i X_i

- N o n or ma l i z a t i o n - W e i g h t sc anb e n e g a t i v eor l a r g e - L esss t ab l e d u r in g t r ainin g 2. S o f t ma x F u s i o n

\alpha_i = \frac{e^{w_i}}{\sum_{j=1}^{N} e^{w_j}}

Y = \sum_{i=1}^{N} \alpha_i X_i

- W e i g h t s n or ma l i ze d t o ‘ [0, 1] ‘ - S u m o f w e i g h t s = 1 - St ab l e b u t co m p u t a t i o na ll y h e a v i er 3. F a s tN or ma l i ze d F u s i o n (E ff i c i e n t De t / Y O L O v 4)

\hat{w}_i = \text{ReLU}(w_i)

Y = \frac{\sum_{i=1}^{N} \hat{w}i X_i}{\sum{i=1}^{N} \hat{w}_i + \epsilon}

- E n s u res n o n - n e g a t i v e w e i g h t s - A v o i d sso f t ma x - F a s t er an d s t ab l e - M os t co mm o n l y u se d in p r a c t i ce - B a g o f F ree bi es f orDe t ec t or 1 ._{C} ross M ini - B N_{:} B a t c h N or ma l i z a t i o n (BN) n or ma l i zes a c t i v a t i o n s u s in g ba t c h s t a t i s t i cs t o im p ro v eco n v er g e n ce

\mu_B = \frac{1}{m}\sum_{i=1}^{m} x_i

\sigma_B^2 = \frac{1}{m}\sum_{i=1}^{m}(x_i - \mu_B)

\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

y_i = \gamma \hat{x}_i + \beta

- W or k s w e ll f or l a r g e ba t c h s i zes - B eco m es n o i sy f ors ma ll ba t c h es (co mm o nin o bj ec t d e t ec t i o n) - F ro m BN \to CBN \to C m BN 1. B a t c h N or ma l i z a t i o n - St a t i s t i csco m p u t e d p er ba t c h - W e i g h t s u p d a t e d a f t ere a c hba t c h 2. C ross - I t er a t i o n B a t c h N or m (CBN) - A gg re g a t ess t a t i s t i cs f ro m p a s t K ba t c h es - R e q u i res w e i g h t - c han g eco m p e n s a t i o n - C o m pl e x an d cos tl y 3. C ross M ini - B a t c h N or m (C m BN) - Y O L O v 4 - A gg re g a t ess t a t i s t i cs w i t hina s in g l e ba t c h - N o w e i g h t co m p e n s a t i o nn ee d e d - W e i g h t s u p d a t e d o n ce p er f u ll ba t c h - C m BN A l g or i t hm (P er M ini - B a t c h) 1. C o m p u t e m e anan d v a r ian ce f orc u rre n t mini - ba t c h 2. A gg re g a t es t a t i s t i cs f ro m p re v i o u s mini - ba t c h es 3. N or ma l i ze a c t i v a t i o n s u s in g a gg re g a t e d s t a t i s t i cs 4. A cc u m u l a t e g r a d i e n t s (n o w e i g h t u p d a t eye t) 5. U p d a t e w e i g h t s a f t er l a s t mini - ba t c h - A gg re g a t i o n C oe ff i c i e n t

\alpha_t = \frac{1}{t+1}

- ‘ t ‘ = mini - ba t c hin d e x (s t a r t in g f ro m 0) - A gg re g a t e d M e an F or m u l a

\mu_t^{agg} = \alpha_t \mu_t + (1 - \alpha_t)\mu_{t-1}^{agg}

- F i rs t mini - ba t c h :

\mu_0^{agg} = \mu_0

- A gg re g a t e d Va r ian ce (C o n ce pt u a l)

\sigma_t^{2,agg} = \alpha_t \sigma_t^2 + (1 - \alpha_t)\sigma_{t-1}^{2,agg}

- N or ma l i z a t i o n U s in g A gg re g a t e d St a t i s t i cs

\hat{x} = \frac{x - \mu_t^{agg}}{\sqrt{\sigma_t^{2,agg} + \epsilon}}

2 ._{C} o m pl e t e I O UL os s_{:} - Wh y MSE I s N o t I d e a l f or B o x R e g ress i o n : U s in g MSE o nb o x coor d ina t es (‘ x 1, y 1, x 2, y 2‘ or ‘ x, y, w, h ‘) ha s maj or i ss u es : - S am e MSE c an corres p o n d t o v ery d i ff ere n t o v er l a p s - N o t sc a l e - in v a r ian t - Opt imi zescoor d ina t e d i s t an ce, n o t o v er l a pq u a l i t y - Does n o t re f l ec tl oc a l i z a t i o n q u a l i t y d i rec tl y - ‘ I o UL oss ‘ : T o d i rec tl yo pt imi zeo v er l a p :

\text{IoU} = \frac{|B \cap B^{gt}|}{|B \cup B^{gt}|}

\mathcal{L}_{IoU} = 1 - \text{IoU}

- L imi t a t i o n - A ll n o n - o v er l a pp in g b o x es g e t i d e n t i c a ll oss - N o n o t i o n o f " h o w f a r " b o x es a re - ‘ G e n er a l i ze d I o U (G I o U) ‘ - A dd s a p e na lt y f or n o n - o v er l a pp in g c a ses .

\text{GIoU} = \text{IoU} - \frac{|C - (B \cup B^{gt})|}{|C|}

\mathcal{L}_{GIoU} = 1 - \text{GIoU}

- Wh ere :

C = \text{smallest enclosing box of } B \text{ and } B^{gt}

- L imi t a t i o n - Wh e n o n e b o x e n c l oses t h eo t h er, G I o U re d u ces t o I o U - Does n o t co n s i d erce n t er d i s t an ceor a s p ec t r a t i o - ‘ D i s t an ce I o U (D I o U) ‘ - A dd ressesce n t er mi s a l i g nm e n t

\text{DIoU} = \text{IoU} - \frac{\rho^2(b, b^{gt})}{c^2}

\mathcal{L}_{DIoU} = 1 - \text{DIoU}

- Wh ere :

\rho(b, b^{gt}) = \text{Euclidean distance between box centers}

c = \text{diagonal length of enclosing box } C

- B e n e f i t s - F a s t erco n v er g e n ce - H an d l es b o t h o v er l a pp in g an d n o n - o v er l a pp in g c a ses - St i ll i g n ores a s p ec t r a t i o - ‘ C o m pl e t e I o U (C I o U) ‘ - A dd s a s p ec t r a t i oco n s i s t e n cy

\text{CIoU} = \text{IoU}

\frac{\rho^2(b, b^{gt})}{c^2}
\alpha v

\mathcal{L}_{CIoU} = 1 - \text{CIoU}

- ‘ A s p ec tR a t i o P e na lt y ‘

v = \frac{4}{\pi^2} \left( \arctan\frac{w^{gt}}{h^{gt}} - \arctan\frac{w}{h} \right)

\alpha = \frac{v}{(1 - \text{IoU}) + v}

∣ L oss ∣ O v er l a p ∣ D i s t an ce ∣ A s p ec tR a t i o ∣ C o n v er g e n ce ∣∣ - - - - ∣ - - - - - - - ∣ - - - - - - - - - - ∣ - - - - - - - - - - - - ∣ - - - - - - - - - - - ∣∣ MSE ∣❌∣❌∣❌∣ Sl o w ∣∣ I o U ∣✅∣❌∣❌∣ M e d i u m ∣∣ G I o U ∣✅∣✅ (co a rse) ∣❌∣ B e tt er ∣∣ D I o U ∣✅∣✅∣❌∣ F a s t er ∣∣ C I o U ∣✅∣✅∣✅∣ F a s t es t ∣3 ._{S} e l f A d v ers a r ia lT r ainin g_{:} S e l f - A d v ers a r ia lT r ainin g im p ro v es m o d e l ro b u s t n ess b y in t e n t i o na ll yco n f u s in g t h e n e tw or k w i t hi t so w n g r a d i e n t s, t h e n t r ainin g i tt oreco v er f ro m t ha t co n f u s i o n . U se d inY O L O v 4 a s a re gu l a r i z a t i o nan d ro b u s t n ess t ec hni q u e . - C ore I d e a : T r ainin g ha pp e n s in tw os t a g es p er ima g e : 1. A tt a c k s t a g e - m o d i f y t h e ima g e t o f oo lt h e m o d e l 2. De f e n ses t a g e - t r ain t h e m o d e lt ocorrec tl y p re d i c t o n t h e m o d i f i e d ima g e - T h e m o d e l e ff ec t i v e l y a tt a c k s i t se l f, t h e n l e a r n s t ores i s tt ha t a tt a c k - St a g e 1 : A d v ers a r ia l I ma g e G e n er a t i o n - F reeze n e tw or k w e i g h t s - U se in correc tl ab e l s (e . g ., “ n oo bj ec t ”) - C o m p u t e l oss an d g r a d i e n t s w . r . t . t h e in p u t ima g e - P er t u r b t h e ima g e t ore d u cecorrec t d e t ec t i o n - L oss (W ro n gL ab e l)

\mathcal{L}{adv} = \mathcal{L}(f\theta(x), y_{\text{fake}})

- I ma g e P er t u r ba t i o n

x’ = x + \epsilon \cdot \text{sign} \left( \nabla_x \mathcal{L}_{adv} \right)

- ‘ x ‘ \to or i g ina l ima g e - ‘ x^{'} ‘ \to a d v ers a r ia l ima g e - ‘ ε ‘ \to p er t u r ba t i o n s t re n g t h - W e i g h t s ‘ θ ‘ re main u n c han g e d - St a g e 2 : R o b u s tT r ainin g - F ee d a d v ers a r ia l ima g e ‘ x^{'} ‘ - U secorrec t g ro u n d - t r u t h l ab e l s - U p d a t e n e tw or k w e i g h t s t oreco v ercorrec tp re d i c t i o n s - L oss (T r u e L ab e l)

\mathcal{L}{sat} = \mathcal{L}(f\theta(x’), y_{\text{true}})

\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}_{sat}

4 ._{G} r i d S e n s i t i v i t y_{:} - Y O L O v 3 C e n t er P re d i c t i o n : I nY O L O v 3, t h e b o u n d in g b o x ce n t er i s p re d i c t e d a s :

x = C_x + \sigma(t_x), y = C_y + \sigma(t_y)

- `C_x, C_y` → top-left coordinates of the grid cell - `t_x, t_y` → raw network outputs - `σ(.)` → sigmoid function - Output range of `σ(.)` is (0, 1) - This ensures the box center lies inside the grid cell #Q Why Grid Sensitivity Occurs ? #A Boundary Case Problem - If the ground-truth box center lies on a grid boundary, e.g.: - exactly at `C_x` - or exactly at `C_x + 1` - Then the model needs: $$ \sigma(t_x) = 0 \quad \text{or} \quad \sigma(t_x) = 1 $$ - Sigmoid Limitation: Sigmoid never outputs exact 0 or 1 - `σ(-4) ≈ 0.018` - `σ(-10) ≈ 0.000045` - `σ(+4) ≈ 0.982` - `σ(+10) ≈ 0.99995` - To approach 0 or 1, the network must predict very large |tₓ| values. - This causes: - Hard optimization near grid boundaries - Unstable gradients - Poor localization for boundary objects - This issue is called grid sensitivity - YOLOv4 Solution: Scaled Offset Prediction - YOLOv4 modifies the center prediction formula: $$ x = C_x + \alpha \cdot \sigma(t_x) - \frac{\alpha - 1}{2}, y = C_y + \alpha \cdot \sigma(t_y) - \frac{\alpha - 1}{2} $$ Where: `α > 1` (scaling factor) - Summary: - Problem: Sigmoid-bounded offsets in YOLOv3 make boundary box prediction difficult - Cause: Sigmoid cannot output exact 0 or 1 - Solution (YOLOv4): Scale sigmoid output with `α > 1` - Benefit: - Reduced grid sensitivity - Easier optimization - Better localization near grid boundaries 5. _Genetic Algorithm_: YOLOv4 uses a genetic algorithm (GA) to automatically find optimal hyperparameters, instead of relying on manual tuning. Examples of tuned hyperparameters: Learning rate, Momentum, Weight decay, Data augmentation probabilities (mosaic, hue, saturation, flip, etc.), Loss-related scaling factors 1. Initialization - Start with an initial population of hyperparameter sets - These can be: - Default values - Slight random variations around known good values 2. Training & Fitness Evaluation - For each hyperparameter set: 1. Train the YOLO model using those hyperparameters 2. Evaluate performance on validation data 3. Compute a fitness score 3. Fitness Score Definition - Fitness reflects how good a hyperparameter set is - It is computed as a weighted sum of multiple metrics, such as:

\text{Fitness} = w_1 \cdot \text{Precision} + w_2 \cdot \text{Recall} + w_3 \cdot \text{mAP}

F_{\text{SPP}} = \text{Concat}\left( F,; \text{MaxPool}{5\times5}(F),; \text{MaxPool}{9\times9}(F),; \text{MaxPool}_{13\times13}(F) \right)

- Spatial size remains the same - Channel depth increases - Each pooled feature captures context at a different scale - Intuition - Small kernels → local details - Large kernels → global context - Concatenation allows the network to choose what context matters 2. _Spatial Attention Module_: - SAM is inspired by CBAM (Convolutional Block Attention Module) - CBAM uses two attention mechanisms: Channel Attention, Spatial Attention - YOLOv4 uses a simplified version focusing on spatial (point-wise) attention only - `Channel Attention` (What to focus on) #Q Which feature channels are most important? #A Process: 1. Input feature map $$ F \in \mathbb{R}^{C \times H \times W} ) $$ 2. Apply: - Global Average Pooling → ( C x1 x 1 ) - Global Max Pooling → ( C x 1 x 1 ) 3. Pass both through a shared MLP 4. Add outputs and apply sigmoid 5. Multiply with input feature map

M_c(F) = \sigma(\text{MLP}(\text{AvgPool}(F)) + \text{MLP}(\text{MaxPool}(F)))

- Output: C x H x W - `Spatial Attention` (Where to focus) #Q Where is the informative region in the feature map? #A Process: 1. Apply: - Channel-wise Average Pooling \to ( 1 x H x W ) - Channel-wise Max Pooling \to ( 1 x H x W ) 2. Concatenate \to ( 2 x H x W ) 3. Apply 7\times7 convolution 4. Apply sigmoid 5. Multiply with input feature map

M_s(F) = \sigma(\text{Conv}_{7\times7}([\text{AvgPool}_c(F), \text{MaxPool}_c(F)]))

- Output: ( C x H x W ) - `SAM in YOLOv4` (Simplified Attention) - YOLOv4 does NOT use full CBAM - Instead, it introduces a lightweight Spatial Attention Module (SAM) - SAM Operation: - Given input feature map $$ F \in \mathbb{R}^{C \times H \times W} $$ 1. Apply 1×1 convolution with: Number of filters = ( C ) 2. Apply sigmoid activation 3. Multiply element-wise with original input

M_{\text{SAM}}(F) = \sigma(\text{Conv}_{1\times1}(F))

F’ = F \odot M_{\text{SAM}}(F)

\text{DistancePenalty}(b, M) = \frac{\rho^2(\text{center}_b, \text{center}_M)}{c^2}

- D I o U M e t r i c U se d in NMS

\text{DIoU}(b, M) = \text{IoU}(b, M) \frac{\rho^2(\text{center}_b, \text{center}_M)}{c^2}

- D I o U - NMSS u pp ress i o n R u l e : A b o x (b) i ss u pp resse d o n l y i f :

\text{DIoU}(b, M) > \tau

- This means: - High overlap alone is not enough - Large center distance reduces suppression score - Here we take a key assumption that, boxes with large center distance likely correspond to different objects, even if they overlap. - ![image.png](https://pimg.mohammadsadiq4950.workers.dev/gist/c041f13a725740112ebe160fcf13b3d7/obsidian-upload-1765910821653.png) ##### YOLO V4 Architecture ![image.png](https://pimg.mohammadsadiq4950.workers.dev/gist/b2201b8f183e7186bb3a499a64a28988/obsidian-upload-1765913111183.png) #### Approach 10: _YOLO V5_ ##### YOLO V5 Architecture ![image.png](https://pimg.mohammadsadiq4950.workers.dev/gist/4511eb5f21d49c3486d9d92272e1f84a/obsidian-upload-1766811848488.png) ##### Matching Targets and Prediction _Predictions_: Like YOLOv3/v4, for each grid cell and each anchor, YOLOv5 predicts: `(5 + C) values` - 4 box parameters \to bounding box transformation - 1 objectness score \to probability that an object exists - C class probabilities _Bounding Box Parameterization_: 1. Center coordinates (x, y) - YOLOv5 uses the same modified formulation as YOLOv4 to reduce grid sensitivity

\begin{aligned}
b_x &= (2 \cdot \sigma(t_x) - 0.5) + c_x \ b_y &= (2 \cdot \sigma(t_y) - 0.5) + c_y
\end{aligned}

- σ: sigmoid, c_x, c_y: top-left coordinates of the grid cell, Range of offsets: (-0.5, 1.5) - This allows prediction of box centers near or across grid boundaries, without requiring sigmoid outputs to be exactly 0 or 1 2. Width and Height (Major change in YOLOv5) - Unlike YOLOv3/v4 (where width & height were unbounded), YOLOv5 bounds them using sigmoid

\begin{aligned}
b_w &= a_w \cdot (2 \cdot \sigma(t_w))^2 \ b_h &= a_h \cdot (2 \cdot \sigma(t_h))^2
\end{aligned}

- a_{w}, a_{h} : an c h or w i d t han d h e i g h t - S in ce σ (t) \in (0, 1) : b_{w}, b_{h} \in (0, 4 x an c h ors i ze) - T hi ss t abi l i zes t r ainin g an d p re v e n t se x t re m e b o x s i zes 3. Y O L O v 5 A n c h or M a t c hin g (R a t i o - ba se d) - Y O L O v 5 d oes n o t u se I o U f or an c h or ma t c hin g - Wi d t h - He i g h tR a t i o C h ec k - F or a g ro u n d - t r u t hb o x an d an c h or b o x :

\max\left(
\frac{w_{gt}}{w_a}, \frac{w_a}{w_{gt}},
\frac{h_{gt}}{h_a}, \frac{h_a}{h_{gt}}
\right) < t

- (t) = anchor threshold (default 4) - An anchor is matched if its dimensions are within (1/4, 4) of the GT box Implications of Ratio-based Matching 1. IoU is irrelevant for matching \to Anchor can be matched even if it doesn’t overlap the GT box 2. Multiple anchors can match the same ground truth 3. One anchor can be matched to multiple ground truths 4. Small objects naturally match small anchors on high-resolution feature maps This improves recall, especially for small objects ##### Grid Cell Assignment in YOLOv5 - Traditional YOLO: Only the grid cell containing the GT center is responsible - YOLOv5: Multiple Responsible Grid Cells: Because center offsets can now lie in (-0.5, 1.5), adjacent grid cells can also predict the same GT box - How it works: 1. Find the main grid cell containing the GT center 2. Divide that cell into four quadrants 3. Depending on where the GT center lies: Include up to 2 adjacent grid cells 4. Grid cells are included only if they exist (not at borders) - This results in: - 1 to 3 grid cells responsible for the same GT box - Each with multiple valid anchors - Improves learning near grid boundaries and stabilizes training _Summary_: For each ground truth box: 1. Select anchors that satisfy width-height ratio constraint 2. Find center grid cell 3. Add adjacent grid cells based on center location 4. For each (grid cell, anchor) pair: - Assign the same ground truth box - Train box regression, objectness, and class prediction ##### YOLO V5 Loss YOLOv5 uses a multi-part loss computed over predictions from multiple detection layers. The total loss is a combination of: 1. _Localization (Bounding Box) Loss_: - YOLOv5 uses Complete IoU (CIoU) loss for localization - Predicted box formulation: - For a grid cell at location (c_x, c_y) and anchor (a_w, a_h): - Center coordinates

\begin{aligned}
b_x &= (2 \cdot \sigma(t_x) - 0.5) + c_x \ b_y &= (2 \cdot \sigma(t_y) - 0.5) + c_y
\end{aligned}

- Wi d t han d h e i g h t

\begin{aligned}
b_w &= a_w \cdot (2 \cdot \sigma(t_w))^2 \
b_h &= a_h \cdot (2 \cdot \sigma(t_h))^2
\end{aligned}

- C I o UL oss

\mathcal{L}{\text{box}} = 1 - \text{CIoU}(B{pred}, B_{gt})

- C I o U in cor p or a t es : I o U o v er l a p, C e n t er d i s t an ce, A s p ec t r a t i oco n s i s t e n cy - L oc a l i z a t i o n l oss i sco m p u t e d o n l y f or f ore g ro u n d (ma t c h e d) b o x es 1. O bj ec t n ess L oss 2. Cl a ss i f i c a t i o n L oss - E a c hi sco m p u t e d p er d e t ec t i o n l a yer an d t h e na gg re g a t e d 2 ._{O} bj ec t n ess L os s_{:} - O bj ec t n ess p re d i c t s w h e t h er an o bj ec t e x i s t s ina g i v e nan c h or - g r i d ce llp ai r - T a r g e t v a l u es - F ore g ro u n d b o x es :

y_{obj} = \text{IoU}(B_{pred}, B_{gt})

- B a c k g ro u n d b o x es :

y_{obj} = 0

- L oss f or m u l a t i o n : B ina ry C ross E n t ro p y (BCE)

\mathcal{L}{obj} = \text{BCE}(\sigma(p{obj}), y_{obj})

- U s in g I o U a s t h eo bj ec t n ess t a r g e t h e lp s t h e m o d e ll e a r nb o x q u a l i t y a w a re n ess 3 ._{C} l a ss i f i c a t i o n L os s_{:} - Y O L O v 5 u ses B ina ry C ross E n t ro p y f orc l a ss i f i c a t i o n (n o t so f t ma x CE) - F or ab o x a ss i g n e d t oc l a ss (c) :

y_c =
\begin{cases}
1 & \text{for ground truth class} \
0 & \text{for all other classes}
\end{cases}

\mathcal{L}{cls} =
\sum{c=1}^{C}
\text{BCE}(\sigma(p_c), y_c)
]

- C o m p u t e d o n l y f or f ore g ro u n d b o x es - S u pp or t s m u lt i - l ab e l c l a ss i f i c a t i o n (e v e n t h o ug h COCO i ss in g l e - l ab e l)_{F} ina l Y O L O v 5 L os s_{:}

\mathcal{L}{total} =
\lambda{box} \mathcal{L}_{box}^{total}

\lambda_{obj} \mathcal{L}_{obj}^{total}
\lambda_{cls} \mathcal{L}_{cls}^{total}

- lambda_{box}, lambda_{obj}, lambda_{cls} are predefined hyperparameters - The final loss is scaled by batch size (as done in the YOLOv5 repo) | Aspect | YOLOv3/v4 | YOLOv5 | | ------------------- | ------------------ | ---------------- | | Box loss | MSE / IoU variants | CIoU | | Width/height | Exponential | Bounded sigmoid² | | Objectness target | Binary (0/1) | IoU-based | | Anchor matching | IoU-based | Ratio-based | | Grid responsibility | Single cell | Multiple cells | ##### Augmentations in YOLOv5 YOLOv5 uses a rich set of augmentations defined in the COCO hyperparameter file. Most augmentations are similar to earlier YOLO versions, with a few notable additions 1. Copy-Paste Augmentation - Object segments and labels are copied from one image - They are pasted onto another image - Generates new training samples with more object diversity - Helps improve robustness, especially for rare objects 2. MixUp Augmentation - Two random images are linearly combined - Corresponding labels are also linearly combined - Encourages smoother decision boundaries - Acts as a strong regularizer 3. Mosaic Augmentation - Combines 4 or sometimes 9 images into a single image - Improves: - Multi-scale object detection - Context understanding - Particularly useful for small objects - Same idea as YOLOv4 Mosaic ##### Hyperparameter Evolution (Genetic Algorithm) YOLOv5 does automatic hyperparameter tuning using a genetic evolution algorithm _Process_: 1. Start with an initial estimate of hyperparameters 2. Train the model and compute a fitness score: Typically mAP 3. Mutate hyperparameters across generations _Key Parameters_: - Mutation probability - How often a parameter is modified - Mutation variance - How much the parameter is modified - This process is repeated for many generations to find optimal values ##### AutoAnchor in YOLOv5 Before training, YOLOv5 checks whether the provided anchors are suitable for the dataset 1. Anchor Checking - Extracts width and height of all ground-truth boxes - Randomly scales boxes to simulate data diversity 2. Anchor Fitness Metrics 1. Anchors Above Threshold - Average number of anchors per target - That satisfy width–height ratio constraints 2. Best Possible Recall (BPR) - Fraction of ground-truth boxes - For which at least one anchor satisfies the ratio constraint - If, Best Possible Recall > 0.98 → Anchors are acceptable ##### Anchor Evolution If anchors are poor: 1. Run K-Means to get initial anchors 2. Apply genetic evolution over ~1000 generations 3. Fitness score is based on anchor-to-target width - height ratio #### Note: ##### IOU (Intersection Over Union) It's used in evaluating bounding box localization accuracy in object detection, measuring the overlap between predicted and ground truth boxes

\text{IoU} = \frac{\text{Area of Intersection}}{\text{Area of Union}}

\text{Precision} = \frac{\text{Cumulative TP}}{\text{Cumulative TP} + \text{Cumulative FP}}

T hi s t e ll s u s, a t e a c h p re d i c t i o n t h res h o l d, t h e p ro p or t i o n o f p re d i c t i o n sso f a r t ha t a recorrec t 6.‘ R ec a ll ‘ :

\text{Recall} = \frac{\text{Cumulative TP}}{\text{Total\ Ground\ Truth\ (GT)}}

This shows, at each step, what fraction of all GT objects have been detected so far After we get the AP for each class we plot it on the graph and the average of area under the plots of the graphs of different classes is Mean Average Precision(mAP). Also the actual computation of AP can be done using multiple by using multiple approaches, here are two of them - `Area under Curve` (AUC): - Precision Recall curve as a series of steps, at each distinct recall level, draw a rectangle whose height is the precision at that point - The area is the sum of the heights of all these rectangles, each weighted by how much recall increased at that step: $$ AP = \sum_{i=1}^{n} (r_i - r_{i-1}) \cdot p_i

    Where `pi` and `ri` are the precision and recall at the `i-th` threshold
- This approach is used by scikit-learn and COCO mAP standards; it's direct and reflects the empirical performance across all recall levels

Interpolation Methods:
- Instead of just using the raw precision at each recall point, 11-point interpolation (PASCAL VOC, classic method) records precision at 11 recall levels (0, 0.1, … 1.0)
- At each recall value, take the highest precision found for that or any higher recall threshold:

InterpolatedPrecision (r) = \overset{r}{^} \geq r max Precision (\overset{r}{^})

- AP is then averaged over these values:    
$$

AP = \frac{1}{11} \sum_{r \in {0, 0.1, \ldots, 1.0}} \text{InterpolatedPrecision}(r)

- This makes the curve "flat" between points, which can slightly inflate scores for models with irregular curves Code for calculating area for Precision Recall graph for a set of classes using both the methods(AUC and Interpolation) ```python def compute_map(det_boxes, gt_boxes, iou_threshold=0.5, method='area'): # Collect all ground-truth class labels gt_labels = {cls_key for im_gt in gt_boxes for cls_key in im_gt.keys()} aps = [] for label in gt_labels: print(f"Computing AP for {label}") # Collect all detections for this class cls_dets = [ [im_idx, det] for im_idx, im_dets in enumerate(det_boxes) for det in im_dets.get(label, []) ] # Sort by confidence (descending) cls_dets = sorted(cls_dets, key=lambda k: -k[1][-1]) # Track matched GT boxes gt_matched = [ [False] * len(im_gts.get(label, [])) for im_gts in gt_boxes ] # Total GT count for this class num_gts = sum(len(im_gts.get(label, [])) for im_gts in gt_boxes) tp = [0] * len(cls_dets) fp = [0] * len(cls_dets) # Loop over detections for det_idx, (im_idx, det_pred) in enumerate(cls_dets): im_gts = gt_boxes[im_idx].get(label, []) max_iou_found = -1 max_iou_gt_idx = -1 # Find best match GT box for gt_idx, gt_box in enumerate(im_gts): iou = get_iou(det_pred[:-1], gt_box) if iou > max_iou_found: max_iou_found = iou max_iou_gt_idx = gt_idx # Apply matching rules if ( max_iou_found < iou_threshold or max_iou_gt_idx == -1 or gt_matched[im_idx][max_iou_gt_idx] ): fp[det_idx] = 1 else: tp[det_idx] = 1 gt_matched[im_idx][max_iou_gt_idx] = True # Accumulate TP, FP tp = np.cumsum(tp) fp = np.cumsum(fp) eps = np.finfo(np.float32).eps recalls = tp / np.maximum(num_gts, eps) precisions = tp / np.maximum((tp + fp), eps) # ----------------------------- # AREA METHOD (Pascal VOC 2010+) # ----------------------------- if method == "area": recalls = np.concatenate(([0.0], recalls, [1.0])) precisions = np.concatenate(([0.0], precisions, [0.0])) # Precision envelope for i in range(len(precisions) - 1, 0, -1): precisions[i - 1] = np.maximum(precisions[i - 1], precisions[i]) # Points where recall changes i = np.where(recalls[1:] != recalls[:-1])[0] # Actual AP = sum over recall steps ap = np.sum((recalls[i + 1] - recalls[i]) * precisions[i + 1]) # ----------------------------- # 11-POINT INTERPOLATED METHOD # ----------------------------- elif method == "interp": ap = 0.0 for interp_r in np.arange(0, 1.0001, 0.1): precision_at_r = precisions[recalls >= interp_r] max_prec = precision_at_r.max() if precision_at_r.size > 0 else 0.0 ap += max_prec ap /= 11.0 else: raise ValueError("Method can only be 'area' or 'interp'") print(f"AP for class {label} with threshold {iou_threshold:.2f} = {ap:.4f}") # compute for all classes and append them aps.append(ap) mean_ap = sum(aps) / len(aps) return mean_ap ``` ##### ROI(Region of Interest) Pooling Layer ROI Pooling is the layer in Fast R‑CNN that converts variable‑size region proposals into fixed H×W features by max‑pooling bins from a shared convolutional feature map, enabling a single conv pass per image and per‑ROI fully connected heads for classification and box regression. It exists to make detection fast and end‑to‑end by avoiding thousands of conv forward passes while still feeding fixed‑size vectors to FC layers - Runs the backbone CNN once per image, then slices features per proposal, which cuts inference from seconds to sub‑second on VOC - Produces fixed‑size tensors for each ROI so FC classification and bounding box regression heads can be shared across proposals of different original sizes How it works ? - Inputs: a conv feature map and a set of ROIs defined in image coordinates, plus the backbone’s stride to map those ROIs into feature‑map coordinates - Quantize the mapped ROI coordinates to integers (rounding) so they index discrete feature‑map cells; this may include some pixels outside the ROI and exclude some inside due to rounding - Divide the quantized ROI window into an H×W grid of bins (e.g., 7×7), and in each bin take channel‑wise max pooling to produce a fixed H×W×C output per ROI that is then flattened for the heads - Example mapping: a 400×400 image with stride 40 yields a 10×10 feature map; an ROI mapped to non‑integer cell boundaries is rounded, then split into equal bins like 2×2 or 7×7 before per‑bin max pooling Given an ROI on the feature map with top-left coordinate (x, y), height (roi_{height}), and width (roi_{width}), ROI pooling divides this window into (pool_h) rows and (pool_w) columns (e.g., 7 x 7) Vertical bin boundaries:

h_{\text{start}} = \left\lfloor \frac{j \cdot roi_{height}}{pool_h} \right\rfloor

h_{\text{end}} = \left\lceil \frac{(j+1) \cdot roi_{height}}{pool_h} \right\rceil

Hor i zo n t a l binb o u n d a r i es :

w_{\text{start}} = \left\lfloor \frac{i \cdot roi_{width}}{pool_w} \right\rfloor

w_{\text{end}} = \left\lceil \frac{(i+1) \cdot roi_{width}}{pool_w} \right\rceil

P oo l in g Op er a t i o n : E a c hbin (i, j) i s ma x - p oo l e d in d e p e n d e n tl y f ore a c h c hann e lT h e f ina lRO I p oo l e d o u tp u t i s a f i x e d - s i ze t e n sor :

(pool_h,; pool_w,; C)

\mathcal{L} = \frac{1}{N_{\text{cls}}} \sum_i L_{\text{cls}}(p_i, p_i^) ;+; \lambda , \frac{1}{N_{\text{reg}}} \sum_i p_i^ , L_{\text{reg}}(t_i, t_i^*)

N o t a t i o n : - i : an c h or in d e x - p_{i} : p re d i c t e d p ro babi l i t y t h e an c h or i s f ore g ro u n d (2 - c l a ss S o f tM a x) - p_{i}^{*} : g ro u n d t r u t h l ab e l (1 = f ore g ro u n d, 0 = ba c k g ro u n d) - t_{i} = (t_{x}, t_{y}, t_{w}, t_{h}) : p re d i c t e d b o x re g ress i o n o ff se t s - t_{i}^{*} = (t_{x}^{*}, t_{y}^{*}, t_{w}^{*}, t_{h}^{*}) : g ro u n d t r u t hb o x re g ress i o n o ff se t s - N_{c} l s = 256 : n u mb ero f s am pl e d an c h ors - N_{r} e g = 2400 : n u mb ero f an c h or l oc a t i o n s - λ = 10 : ba l an c in g w e i g h t - Cl a ss i f i c a t i o n L oss :

L_{\text{cls}}(p_i, p_i^*) = -\log(p_{i,\text{correct}})

C ross - e n t ro p yo v er a ll 256 s am pl e d an c h ors (b o t h p os i t i v e an d n e g a t i v e) - R e g ress i o n L oss :

L_{\text{reg}}(t_i, t_i^) = \sum_{j \in {x, y, w, h}} \text{SmoothL1}(t_{i,j} - t_{i,j}^)

- S m oo t h L 1 f u n c t i o n :

\text{SmoothL1}(x) = \begin{cases} 0.5x^2, & |x| < 1 \ |x| - 0.5, & \text{otherwise} \end{cases}

- T h ere g ress i o n l oss i s a ppl i e d o n l y f or p os i t i v e an c h ors :

p_i^* = 1 \quad \Rightarrow \quad \text{apply reg loss}

I f (p_{i}^{*} = 0), re g ress i o ni s i g n ore d T a r g e tT r an s f or ma t i o n C o m p u t a t i o n : G i v e n : - A n c h or b o x : x_{a}, y_{a}, w_{a}, h_{a} - G ro u n d t r u t hb o x : x_{g t}, y_{g t}, w_{g t}, h_{g t}

t_x^* = \frac{x_{gt} - x_a}{w_a}, \quad t_y^* = \frac{y_{gt} - y_a}{h_a}

t_w^* = \log\left(\frac{w_{gt}}{w_a}\right), \quad t_h^* = \log\left(\frac{h_{gt}}{h_a}\right)

I n v erse t r an s f or m (u se dd u r in g in f ere n ce) :

x’ = t_x \cdot w_a + x_a, \quad y’ = t_y \cdot h_a + y_a

w’ = w_a \cdot e^{t_w}, \quad h’ = h_a \cdot e^{t_h}

T hi s f or m u l a t i o n p ro v i d essc a l e - in v a r ian t re g ress i o n t h ro ug h t h e u seo f l o g - s p a ce w i d t h / h e i g h t_{F} a s tR - CNN De t ec t i o n He a d T r ainin g_{:} O n ce RPN g e n er a t es p ro p os a l s, t h e d e t ec t i o nh e a d (F a s tR - CNNp a r t) i s t r ain e d F ro m RPN - g e n er a t e d p ro p os a l s (a f t er NMS, t o p - 2000 d u r in g t r ainin g) : F ore g ro u n d p ro p os a l s : - I O U > 0.5 w i t han y g ro u n d t r u t hb o x - A ss i g n e d c l a ss = c l a sso f GT b o x w i t hhi g h es t I O U B a c k g ro u n d p ro p os a l s : - 0.1 \leq I O U < 0.5 w i t han y GT b o x - A ss i g n e d c l a ss = 0 (ba c k g ro u n d) I g n ore d : - I O U < 0.1 (n o t s am pl e d) M ini - ba t c h S am pl in g f orDe t ec t i o n : - S am pl e N = 2 ima g es p er ba t c h - F ro m e a c hima g e, s am pl e R / N = 64/2 = 32 p ro p os a l s - T o t a l ba t c h s i ze : 64 p ro p os a l s - M ain t ain 25 S o p er ba t c h : 16 f ore g ro u n d, 48 ba c k g ro u n d p ro p os a l sDe t ec t i o n L oss (F a s tR - CNN) :

\mathcal{L}(p, u, t^{u}, v)

L_{\text{cls}}(p, u) + \lambda ,[u \ge 1] , L_{\text{loc}}(t^{u}, v)

N o t a t i o n : - p = (p_{0}, p_{1}, ..., p_{K}) : p re d i c t e d c l a ss p ro babi l i t i es - u : g ro u n d t r u t h c l a ss - (u = 0) \to ba c k g ro u n d - (u = 1... K) \to o bj ec t c l a sses - t^{u} : p re d i c t e d b o x d e lt a s f orc l a ss (u) (f ro m 4 Ko u tp u t s, c h oose t h e 4 corres p o n d in g t oc l a ss (u)) - v : g ro u n d t r u t hb o x re g ress i o n t a r g e tCl a ss i f i c a t i o n L oss :

L_{\text{cls}}(p, u) = -\log(p_u)

L oc a l i z a t i o n (B o u n d in g B o x R e g ress i o n) :

L_{\text{loc}}(t^{u}, v)

\sum_{i \in {x, y, w, h}} \text{SmoothL1}(t^{u}_i - v_i)

S m oo t h L 1 d e f ini t i o n (f orre f ere n ce) :

\text{SmoothL1}(x) = \begin{cases} 0.5x^2, & |x| < 1 \ |x| - 0.5, & \text{otherwise} \end{cases}

R e g ress i o na ppl i eso n l y t o f ore g ro u n d c l a sses :

[u \ge 1] = \begin{cases} 1, & u > 0 \ 0, & u = 0 \end{cases}

\mathcal{L}_{giou} = - \text{GIoU}(b, \hat{b})

T o t a lM a t c hin g C os t : T h e f ina l cos t f or a ss i g nin g a p re d i c t e d b o x t o a t a r g e t b o x i s a w e i g h t e d s u m :

\mathcal{C}{i,j} = \lambda{cls} \mathcal{L}_{cls}

\lambda_{L1} \mathcal{L}_{L1}
\lambda_{giou} \mathcal{L}_{giou}

Wh ere : - ‘ i ‘ \to p re d i c t e d b o x in d e x - ‘ j ‘ \to t a r g e t b o x in d e x - ‘ λ ‘ v a l u es a re w e i g h t in g h y p er p a r am e t er s_{D} ETR L os s_{:} - D ETR u ses t h e a ss i g nm e n t s p ro d u ce d b yH u n g a r ianma t c hin g - T h e l oss i s a co mbina t i o n o f : 1. Cl a ss i f i c a t i o n L oss - F or p re d i c t e d b o x es a ss i g n e d t o a t a r g e t : ‘ G ro u n d - t r u t h c l a ss = c l a sso f t h e a ss i g n e d t a r g e t b o x ‘ - F or u na ss i g n e d p re d i c t i o n s : ‘ G ro u n d - t r u t h c l a ss = ba c k g ro u n d ‘ - Cl a ss i f i c a t i o n l oss i sco m p u t e df or a llp re d i c t e d b o x es

\mathcal{L}_{cls} = \text{CE}(y_i, \hat{p}_i)

- Wh ere : - ‘ y_{i} ‘ = g ro u n d - t r u t h c l a ss l ab e l (in c l u d in g ba c k g ro u n d) - ‘ p_{i} ‘ = p re d i c t e d c l a ss p ro babi l i t i es - ‘ CE ‘ = cross - e n t ro p y l oss - L oc a l i z a t i o n l oss - E a c h p re d i c t e d b o x i s a ss i g n e d e i t h er : - a g ro u n d - t r u t h t a r g e t b o x, or - t h e ba c k g ro u n d c l a ss 2. L oc a l i z a t i o n L oss - L oc a l i z a t i o n l oss i sco m p u t e d o n l y f or p re d i c t i o n s a ss i g n e d t o an o n - ba c k g ro u n d t a r g e t - S m oo t h L 1 L oss : - M e a s u reserror b e tw ee n p re d i c t e d an d t a r g e t b o x coor d ina t es - B o x f or ma t : ‘ (c_{x}, c_{y}, w, h) ‘

\mathcal{L}_{L1} = \text{SmoothL1}(b_i, \hat{b}_i)

- G e n er a l i ze d I O U (G I o U) L oss - E n co u r a g es hi g h ero v er l a p b e tw ee n p re d i c t e d an d t a r g e t b o x es - De f in e d a so n e min u s G I o U

\mathcal{L}_{giou} = 1 - \text{GIoU}(b_i, \hat{b}_i)

3. F ina l D ETR L oss - T h e t o t a ll oss i s a w e i g h t e d s u m o f c l a ss i f i c a t i o nan d l oc a l i z a t i o n l osses :

\mathcal{L}{DETR} = \lambda{cls} \mathcal{L}_{cls}

\lambda_{L1} \mathcal{L}_{L1}
\lambda_{giou} \mathcal{L}_{giou}

- Where: `λ` values control the relative importance of each term - Each predicted box is assigned either: - a ground-truth target box, or - the background class _Auxiliary Losses in DETR_: #Q Why Auxiliary Losses Are Used #A - Using only the final decoder layer can make training slow to converge - To improve convergence and mAP: DETR applies losses at all decoder layers, not just the last one. Auxiliary Loss Mechanism - Each decoder layer outputs object query representations. - These outputs are passed through the same shared MLPs: - One for class prediction - One for bounding box prediction - For each decoder layer: 1. Predicted boxes are generated 2. Hungarian matching is applied independently 3. Classification and localization losses are computed Auxiliary Loss Formula - If there are `L` decoder layers:

\mathcal{L}{aux} = \sum{l=1}^{L} \left( \lambda_{cls} \mathcal{L}_{cls}^{(l)}

\lambda_{L1} \mathcal{L}_{L1}^{(l)}
\lambda_{giou} \mathcal{L}_{giou}^{(l)} \right)

- T o t a lT r ainin gL oss

\mathcal{L}{total} = \mathcal{L}{DETR}^{(final)}

\mathcal{L}_{aux}

Sadiq's Knowledge Vaults

Explorer

13. Object Detection