GoogleNet (later renamed Inception v1) by Christian Szegedy et al., Google Brain (2014) - Won 1st place in ImageNet Challenge 2014 (ILSVRC14) with 6.67% top-5 error rate. The name “Inception” came from the movie reference “we need to go deeper”
Between 2014-2016, deep learning faced: “How to improve CNN performance without exploding compute and parameter costs?” Three architectures emerged:
- VGGNet (2014): Simplicity and depth - 138M parameters
- GoogleNet/Inception (2014): Multi-scale parallelism - 6.8M parameters
- SqueezeNet (2016): Parameter efficiency
Core Design Philosophy: “Why use one filter type when you can use all of them at once?” Apply multiple convolution filters of different sizes in parallel within each module
- Build deep networks that are computationally efficient
- Use multi-scale feature extraction instead of choosing one filter size
- Dramatically reduce parameters compared to VGG (12x fewer parameters)
- Inspired by Network in Network (Lin et al., 2013) concept
Architecture
Architecture Details:
- 22 layers deep (counting only layers with parameters)
- 27 layers if pooling layers included
- About 100 independent building blocks total
Parameter Count: Only 6.8 million parameters vs VGG’s 138 million - proving that size doesn’t matter, efficiency does. Three-Part Structure
- Stem (data ingestion): Initial conv layers down sample images
- Body (data processing): Stacked Inception modules perform bulk processing
- Head (prediction): Global average pooling + FC + SoftMax output
Four Parallel Branches: Each Inception module processes input through 4 parallel paths simultaneously:
- Branch 1: 1×1 convolution
- Branch 2: 1×1 convolution → 3×3 convolution
- Branch 3: 1×1 convolution → 5×5 convolution
- Branch 4: 3×3 max pooling → 1×1 convolution
All outputs are concatenated depth-wise (along channel dimension)
Purpose:
- Capture features at multiple scales simultaneously
- Let the network decide which filter size is optimal through training
- Increase representational power without dramatic computation increase
Revolutionary Techniques:
-
1×1 Convolutions (Bottleneck Layers) Purpose: Dimensionality reduction before expensive convolutions Example computation savings:
- Without 1×1: (14×14×48) × (5×5×480) = 112.9M operations
- With 1×1 reduction: (14×14×16) × (1×1×480) + (14×14×48) × (5×5×16) = 5.3M operations
Result: 20x reduction in computation without sacrificing performance Additional benefits:
- Adds non-linearity (ReLU activation applied after)
- Dual-purpose: dimension reduction + feature mixing
- Maintains spatial dimensions while compressing channels
- Enables deeper networks without parameter explosion
-
Global Average Pooling Replaces traditional fully connected layers at the end:
- Converts 7×7 feature maps to 1×1 by averaging each channel
- Zero additional trainable parameters
- Reduces overfitting dramatically
- Improved top-1 accuracy by ~0.6%
-
Auxiliary Classifiers Two intermediate classification branches attached at 1/3 and 2/3 depth to combat vanishing gradient problem
Structure of each auxiliary classifier:
- Average pooling (5×5, stride 3)
- 1×1 convolution (128 filters + ReLU)
- Fully connected (1024 units + ReLU)
- Dropout (rate = 0.7)
- SoftMax output (1000 classes)
Loss function: Important: Auxiliary classifiers only used during training, removed during inference
-
Local Response Normalization (LRN) Applied early in network to normalize feature maps and improve generalization
Performance Comparison Table
| Model | Year | Place | Top-5 Error | Parameters |
|---|---|---|---|---|
| AlexNet (SuperVision) | 2012 | 1st | 15.3% | ~60M |
| Clarifai | 2013 | 1st | 11.2% | - |
| VGG | 2014 | 2nd | 7.32% | 138M |
| GoogLeNet | 2014 | 1st | 6.67% | 6.8M |
Parallel Multi-Scale Processing:
Traditional CNN: Choose one filter size (3×3 OR 5×5 OR 7×7)
GoogleNet approach: Use ALL filter sizes simultaneously in parallel
Advantages:
- Network learns which scales matter most
- Captures both fine details (1×1, 3×3) and broader context (5×5)
- More robust feature extraction
- Better performance with less computation
Q What Made It “Go Deeper” Successfully #A
- Smart dimensionality reduction via 1×1 convolutions kept parameters low
- Auxiliary classifiers solved vanishing gradient in middle layers
- Global average pooling eliminated parameter-heavy FC layers
- Multi-scale processing improved feature quality without adding depth