SqueezeNet by Forrest Iandola et al., DeepScale, UC Berkeley & Stanford (2016) - Achieved AlexNet-level accuracy with 50x fewer parameters and compressed to <0.5MB model size (510x smaller than AlexNet)
“What if we didn’t need 100+ million parameters to get state-of-the-art accuracy?”
Focus shifted from accuracy alone to efficiency with equivalent accuracy:
- Smaller DNNs require less communication across servers during distributed training
- Smaller DNNs require less bandwidth to export models from cloud to autonomous cars
- Smaller DNNs are more feasible to deploy on FPGAs and hardware with limited memory
Key Innovation:
Introduce the Fire Module - a squeeze-and-expand building block that drastically reduces parameters while maintaining accuracy
Three Architectural Design Strategies:
- Strategy 1: Replace 3×3 filters with 1×1 filters
- A 1×1 filter has 9x fewer parameters than 3×3
- Strategy 2: Decrease the number of input channels to 3×3 filters
- Use squeeze layers to reduce depth before expensive convolutions
- Strategy 3: Down sample late in the network
- Keep large activation maps longer for higher classification accuracy
Parameter Calculation:
Standard convolution layer: D×D×M×N parameters
With Fire module: (1×1×M×s)+(1×1×s×N/2)+(3×3×s×N/2)
Example (D=3, M=256, N=512, s=32 with SR=0.125):
- Standard: 1,179,648 parameters
- Fire module: 180,224 parameters
- Compression ratio: 84.72%
Squeeze Ratio (SR):
- SqueezeNet uses SR = 0.125, meaning squeeze layer has 8x fewer channels than expand layer
- Tunable hyperparameter: Higher SR = larger model with better accuracy; Lower SR = smaller model
Architecture
Network Structure: Total layers: 18 weight layers (conv + fire modules)
- Layer 1: Regular convolution layer (96 filters, 7×7, stride 2)
- Layers 2-9: Eight Fire modules (fire2 through fire9)
- Layer 10: Regular convolution layer (1000 filters, 1×1)
- Layer 11: Global average pooling → SoftMax
Fire Module Progression: Gradually increase filters per module:
- fire2: 16 squeeze, 64+64 expand
- fire3: 16 squeeze, 64+64 expand
- fire4: 32 squeeze, 128+128 expand
- fire5: 32 squeeze, 128+128 expand
- fire6: 48 squeeze, 192+192 expand
- fire7: 48 squeeze, 192+192 expand
- fire8: 64 squeeze, 256+256 expand
- fire9: 64 squeeze, 256+256 expand
Pooling Strategy (Delayed Down sampling): Max-pooling (3×3, stride 2) applied only after:
- conv1 (layer 1)
- fire4 (layer 4)
- fire8 (layer 8)
Result: Large activation maps maintained throughout most of the network
- No fully connected layers (inspired by Network-in-Network)
- Global average pooling instead of FC layers
- Dropout (p=0.5) after fire9 to reduce overfitting
- ReLU activation applied after all squeeze and expand layers