ResNet (2015), by Kaiming He et al. from Microsoft Research, introduced the concept of residual learning and “skip connections”, enabling the successful training of very deep neural networks (up to 152 layers) and sparking a revolution in deep learning for computer vision. ResNet won the ILSVRC 2015 ImageNet classification competition with a top-5 error of 3.57%, surpassing human-level performance
Before ResNet, making neural networks deeper did not consistently improve accuracy; after a certain number of layers, performance actually degraded or saturated. This issue went beyond overfitting and was tied to optimization difficulties (primarily vanishing gradients)
The Key Idea is implementing two things
- Vanilla (Plain) Block A traditional block with two layers:
where:
- ( a^{[l]} ) is the input,
- ( W^{[l+2]} ) and ( b^{[l+2]} ) are weights and biases,
- ( g ) is the activation function (e.g., ReLU).
- Residual Block (ResNet Block) Instead of learning the mapping ( H(x) ) directly, learn the residual:
where:
- ( x_l ) = input to the ( l )-th layer,
- ( F(x_l, W_l) ) = residual function (“the change to make”),
- ( f ) = activation function (usually ReLU).
The “skip connection” allows the input ( x_l ) to be added directly to the output after the stacked layers
If the stacked layers fail to learn anything useful, the network just passes the input forward ( (F(x) = 0) ), ensuring at least identity mapping is possible
Why Residual Connections Work
- Mitigate vanishing/exploding gradients: Directly pass gradients to earlier layers during backpropagation, alleviating optimization hurdles in very deep networks
- Easier Optimization: Learning residuals is empirically easier than learning the full mapping. Blocks can focus on finer corrections, allowing the training of networks of hundreds or even thousands of layers
- Ensemble-like Behavior: The structure behaves like an ensemble of many shallow subnetworks, improving generalization
Block Variants
- Basic Block (ResNet-34 and below)
- Two stacked 3×3 convolution layers,
- Simple skip (identity) connection
- Bottleneck Block (ResNet-50, 101, 152)
- Three layers per block: 1×1→3×3→1×11\times1 \to 3\times3 \to 1\times11×1→3×3→1×1 convolutions,
- First 1×1 reduces dimensions (bottleneck), middle 3×3 processes spatial info, last 1×1 restores dimensions
- Enables deeper models with fewer computations.github+1
Architecture Details
- ResNet-34: Stacked basic blocks
- ResNet-50/101/152: Stacked bottleneck blocks, e.g., ResNet-50 = 49 Conv + 1 FC = 50 layers
- No extra parameters are introduced by the skip connection (if the input and output dimensions match)
- Down sampling is performed via stride-2 convolutions in certain blocks
Empirical Performance
| Model | Year | Top-5 Error (ImageNet) | Parameters | Depth |
|---|---|---|---|---|
| VGG-19 | 2014 | 7.3% | 143M | 19 |
| GoogLeNet | 2014 | 6.7% | 6.8M | 22 |
| ResNet-152 | 2015 | 3.57% | 60M | 152 |
- ResNet-152: 8× deeper but lower complexity than VGG.cs231n
- Won ILSVRC 2015 in classification, detection, localization, and COCO competitions