6. GoogleNet (Inception V1)

GoogleNet (later renamed Inception v1) by Christian Szegedy et al., Google Brain (2014) - Won 1st place in ImageNet Challenge 2014 (ILSVRC14) with 6.67% top-5 error rate. The name “Inception” came from the movie reference “we need to go deeper”
Between 2014-2016, deep learning faced: “How to improve CNN performance without exploding compute and parameter costs?” Three architectures emerged:

VGGNet (2014): Simplicity and depth - 138M parameters
GoogleNet/Inception (2014): Multi-scale parallelism - 6.8M parameters
SqueezeNet (2016): Parameter efficiency

Core Design Philosophy: “Why use one filter type when you can use all of them at once?” Apply multiple convolution filters of different sizes in parallel within each module

Build deep networks that are computationally efficient
Use multi-scale feature extraction instead of choosing one filter size
Dramatically reduce parameters compared to VGG (12x fewer parameters)
Inspired by Network in Network (Lin et al., 2013) concept

Architecture

Architecture Details:

22 layers deep (counting only layers with parameters)
27 layers if pooling layers included
About 100 independent building blocks total

Parameter Count: Only 6.8 million parameters vs VGG’s 138 million - proving that size doesn’t matter, efficiency does. Three-Part Structure

Stem (data ingestion): Initial conv layers down sample images
Body (data processing): Stacked Inception modules perform bulk processing
Head (prediction): Global average pooling + FC + SoftMax output

Four Parallel Branches: Each Inception module processes input through 4 parallel paths simultaneously:

Branch 1: 1×1 convolution
Branch 2: 1×1 convolution → 3×3 convolution
Branch 3: 1×1 convolution → 5×5 convolution
Branch 4: 3×3 max pooling → 1×1 convolution

All outputs are concatenated depth-wise (along channel dimension)

Purpose:

Capture features at multiple scales simultaneously
Let the network decide which filter size is optimal through training
Increase representational power without dramatic computation increase

Revolutionary Techniques:

1×1 Convolutions (Bottleneck Layers) Purpose: Dimensionality reduction before expensive convolutions Example computation savings:
- Without 1×1: (14×14×48) × (5×5×480) = 112.9M operations
- With 1×1 reduction: (14×14×16) × (1×1×480) + (14×14×48) × (5×5×16) = 5.3M operations
Result: 20x reduction in computation without sacrificing performance Additional benefits:
- Adds non-linearity (ReLU activation applied after)
- Dual-purpose: dimension reduction + feature mixing
- Maintains spatial dimensions while compressing channels
- Enables deeper networks without parameter explosion
Global Average Pooling Replaces traditional fully connected layers at the end:
- Converts 7×7 feature maps to 1×1 by averaging each channel
- Zero additional trainable parameters
- Reduces overfitting dramatically
- Improved top-1 accuracy by ~0.6%
Auxiliary Classifiers Two intermediate classification branches attached at 1/3 and 2/3 depth to combat vanishing gradient problem

Structure of each auxiliary classifier:
- Average pooling (5×5, stride 3)
- 1×1 convolution (128 filters + ReLU)
- Fully connected (1024 units + ReLU)
- Dropout (rate = 0.7)
- SoftMax output (1000 classes)
Loss function: $L = 0.3 L_{a ux, 1} + 0.3 L_{a ux, 2} + L_{re a l}$ Important: Auxiliary classifiers only used during training, removed during inference
Local Response Normalization (LRN) Applied early in network to normalize feature maps and improve generalization

Performance Comparison Table

Model	Year	Place	Top-5 Error	Parameters
AlexNet (SuperVision)	2012	1st	15.3%	~60M
Clarifai	2013	1st	11.2%	-
VGG	2014	2nd	7.32%	138M
GoogLeNet	2014	1st	6.67%	6.8M

Parallel Multi-Scale Processing: Traditional CNN: Choose one filter size (3×3 OR 5×5 OR 7×7)
GoogleNet approach: Use ALL filter sizes simultaneously in parallel Advantages:

Network learns which scales matter most
Captures both fine details (1×1, 3×3) and broader context (5×5)
More robust feature extraction
Better performance with less computation

Q What Made It “Go Deeper” Successfully #A

Smart dimensionality reduction via 1×1 convolutions kept parameters low
Auxiliary classifiers solved vanishing gradient in middle layers
Global average pooling eliminated parameter-heavy FC layers
Multi-scale processing improved feature quality without adding depth

Sadiq's Knowledge Vaults

Explorer

6. GoogleNet (Inception V1)

Architecture

Performance Comparison Table

Graph View

Sadiq's Knowledge Vaults

Explorer

6. GoogleNet (Inception V1)

Architecture

Performance Comparison Table​

Graph View

Performance Comparison Table