5. SqueezeNet

SqueezeNet by Forrest Iandola et al., DeepScale, UC Berkeley & Stanford (2016) - Achieved AlexNet-level accuracy with 50x fewer parameters and compressed to <0.5MB model size (510x smaller than AlexNet)
“What if we didn’t need 100+ million parameters to get state-of-the-art accuracy?”

Focus shifted from accuracy alone to efficiency with equivalent accuracy:

Smaller DNNs require less communication across servers during distributed training
Smaller DNNs require less bandwidth to export models from cloud to autonomous cars
Smaller DNNs are more feasible to deploy on FPGAs and hardware with limited memory

Key Innovation: Introduce the Fire Module - a squeeze-and-expand building block that drastically reduces parameters while maintaining accuracy
Three Architectural Design Strategies:

Strategy 1: Replace 3×3 filters with 1×1 filters
- A 1×1 filter has 9x fewer parameters than 3×3
Strategy 2: Decrease the number of input channels to 3×3 filters
- Use squeeze layers to reduce depth before expensive convolutions
Strategy 3: Down sample late in the network
- Keep large activation maps longer for higher classification accuracy

Parameter Calculation: Standard convolution layer: D×D×M×N parameters
With Fire module: (1×1×M×s)+(1×1×s×N/2)+(3×3×s×N/2) Example (D=3, M=256, N=512, s=32 with SR=0.125):

Standard: 1,179,648 parameters
Fire module: 180,224 parameters
Compression ratio: 84.72%

Squeeze Ratio (SR): $SR = Filters in Expand Layer / Filters in Squeeze Layer$

SqueezeNet uses SR = 0.125, meaning squeeze layer has 8x fewer channels than expand layer
Tunable hyperparameter: Higher SR = larger model with better accuracy; Lower SR = smaller model

Architecture

Network Structure: Total layers: 18 weight layers (conv + fire modules)

Layer 1: Regular convolution layer (96 filters, 7×7, stride 2)
Layers 2-9: Eight Fire modules (fire2 through fire9)
Layer 10: Regular convolution layer (1000 filters, 1×1)
Layer 11: Global average pooling → SoftMax

Fire Module Progression: Gradually increase filters per module:

fire2: 16 squeeze, 64+64 expand
fire3: 16 squeeze, 64+64 expand
fire4: 32 squeeze, 128+128 expand
fire5: 32 squeeze, 128+128 expand
fire6: 48 squeeze, 192+192 expand
fire7: 48 squeeze, 192+192 expand
fire8: 64 squeeze, 256+256 expand
fire9: 64 squeeze, 256+256 expand

Pooling Strategy (Delayed Down sampling): Max-pooling (3×3, stride 2) applied only after:

conv1 (layer 1)
fire4 (layer 4)
fire8 (layer 8)

Result: Large activation maps maintained throughout most of the network

No fully connected layers (inspired by Network-in-Network)
Global average pooling instead of FC layers
Dropout (p=0.5) after fire9 to reduce overfitting
ReLU activation applied after all squeeze and expand layers

Sadiq's Knowledge Vaults

Explorer

Architecture

Graph View