7. ResNet

ResNet (2015), by Kaiming He et al. from Microsoft Research, introduced the concept of residual learning and “skip connections”, enabling the successful training of very deep neural networks (up to 152 layers) and sparking a revolution in deep learning for computer vision. ResNet won the ILSVRC 2015 ImageNet classification competition with a top-5 error of 3.57%, surpassing human-level performance
Before ResNet, making neural networks deeper did not consistently improve accuracy; after a certain number of layers, performance actually degraded or saturated. This issue went beyond overfitting and was tied to optimization difficulties (primarily vanishing gradients)
The Key Idea is implementing two things

Vanilla (Plain) Block A traditional block with two layers:

a^{[l + 2]} = g (W^{[l + 2]} a^{[l + 1]} + b^{[l + 2]})

where:

( a^{[l]} ) is the input,
( W^{[l+2]} ) and ( b^{[l+2]} ) are weights and biases,
( g ) is the activation function (e.g., ReLU).

Residual Block (ResNet Block) Instead of learning the mapping ( H(x) ) directly, learn the residual:

F (x) = H (x) - x

where:

( x_l ) = input to the ( l )-th layer,
( F(x_l, W_l) ) = residual function (“the change to make”),
( f ) = activation function (usually ReLU). The “skip connection” allows the input ( x_l ) to be added directly to the output after the stacked layers
If the stacked layers fail to learn anything useful, the network just passes the input forward ( (F(x) = 0) ), ensuring at least identity mapping is possible

Why Residual Connections Work

Mitigate vanishing/exploding gradients: Directly pass gradients to earlier layers during backpropagation, alleviating optimization hurdles in very deep networks
Easier Optimization: Learning residuals is empirically easier than learning the full mapping. Blocks can focus on finer corrections, allowing the training of networks of hundreds or even thousands of layers
Ensemble-like Behavior: The structure behaves like an ensemble of many shallow subnetworks, improving generalization

Block Variants

Basic Block (ResNet-34 and below)
- Two stacked 3×3 convolution layers,
- Simple skip (identity) connection
Bottleneck Block (ResNet-50, 101, 152)
- Three layers per block: 1×1→3×3→1×11\times1 \to 3\times3 \to 1\times11×1→3×3→1×1 convolutions,
- First 1×1 reduces dimensions (bottleneck), middle 3×3 processes spatial info, last 1×1 restores dimensions
- Enables deeper models with fewer computations.github+1

Architecture Details

ResNet-34: Stacked basic blocks
ResNet-50/101/152: Stacked bottleneck blocks, e.g., ResNet-50 = 49 Conv + 1 FC = 50 layers
No extra parameters are introduced by the skip connection (if the input and output dimensions match)
Down sampling is performed via stride-2 convolutions in certain blocks

Empirical Performance

Model	Year	Top-5 Error (ImageNet)	Parameters	Depth
VGG-19	2014	7.3%	143M	19
GoogLeNet	2014	6.7%	6.8M	22
ResNet-152	2015	3.57%	60M	152

ResNet-152: 8× deeper but lower complexity than VGG.cs231n
Won ILSVRC 2015 in classification, detection, localization, and COCO competitions

Sadiq's Knowledge Vaults

Explorer

Why Residual Connections Work

Block Variants

Architecture Details

Empirical Performance

Graph View