EfficientNet (2019) by Mingxing Tan and Quoc V. Le, Google Brain — Won first place in MLCommons Inference Benchmark and set new records for accuracy-efficiency balance. A single baseline model (EfficientNet-B0) scaled uniformly to create a family of 8 models (B0 through B7). EfficientNet-B7 achieves 84.3% ImageNet top-1 accuracy while being 8.4x smaller and 6.1x faster than the previous best ConvNet (GPipe, 557M parameters vs EfficientNet-B7’s 66M)
Before EfficientNet, researchers scaled CNNs by increasing one dimension at a time: depth (more layers), width (more channels), or resolution (larger inputs). EfficientNet answered: “What if we scale all three dimensions together using a principled formula?” The insight: these dimensions are interdependent and must be balanced for optimal accuracy and efficiency
In Traditional Scaling (Pick One)
- Depth scaling only (ResNet-18 → ResNet-200): Vanishing gradients, harder to train
- Width scaling only (Wide ResNets): Quadratic compute increase with minimal accuracy gain
- Resolution scaling only: Expensive (O(n²) pixel increase) without architecture to leverage it
- Result: Diminishing returns, suboptimal accuracy-efficiency trade-off
The Key Insight is Depth, width, and resolution are interdependent:
- Bigger images → network needs more layers (larger receptive field)
- More layers → network needs more channels (process details better)
- These must grow in balanced proportion for efficiency
Compound Scaling - The Core Innovation
Instead of scaling one dimension, scale all three simultaneously using fixed ratios controlled by a single coefficient φ (phi):
d' = d · α^φ (new depth)
w' = w · β^φ (new width)
r' = r · γ^φ (new resolution)Constraint (ensures balanced growth):
α · β² · γ² ≈ 2Intuition: If you have 2x more computational budget, increase depth by α^1, width by β^1, and resolution by γ^1, where α, β, γ are discovered constants
Finding Optimal Exponents (α, β, γ)
Step 1: Fix φ = 1, assume 2x more resources, grid search for best α, β, γ:
- Found empirically: α = 1.2, β = 1.1, γ = 1.15
- Verify: 1.2 × 1.1² × 1.15² ≈ 2 ✓
Step 2: Keep α, β, γ constant, vary φ to create B0 through B7:
- φ = 0 → EfficientNet-B0 (baseline)
- φ = 1 → B1 (2x resources)
- φ = 2 → B2 (4x resources)
- … and so on
EfficientNet-B0 - The Baseline
EfficientNet-B0 uses MBConv (Mobile Inverted Bottleneck) blocks as core building unit, inspired by MobileNetV2 with added Squeeze-and-Excitation (SE) attention
MBConv Block Structure:
Step 1: Expand (increase channels)
x → Conv(1×1) → BN → Swish → x_expandedStep 2: Depthwise Conv (spatial filtering per channel)
x_expanded → DWConv(3×3 or 5×5) → BN → Swish → x_dwStep 3: Squeeze-and-Excitation (channel attention)
x_dw → GlobalAvgPool → FC → Swish → FC → Sigmoid → scale_weights x_dw * scale_weights → x_seStep 4: Project (reduce channels with linear activation)
x_se → Conv(1×1) → BN → x_projectStep 5: Skip Connection (residual if input/output match)
output = x_project + x (if dims match, else just x_project)Squeeze-and-Excitation (SE) Block Details
Purpose: Learn per-channel importance weights to adaptively recalibrate features Mechanism:
- Squeeze (global context): Global average pooling reduces spatial dimensions
s_c = (1/(H×W)) · Σ_x_i,j^c- Excitation (channel relationships): Two FC layers learn channel interdependencies
z_c = FC2(Swish(FC1(s))) ∈ [0,1] (Sigmoid output)- Scale (recalibrate features):
x_c^scaled = z_c · x_cResult: Each channel weighted by learned importance, dramatically improving feature quality with minimal overhead
Scaling B0 to Create B1-B7:
| Model | φ | Depth (α^φ) | Width (β^φ) | Resolution (γ^φ) | Parameters | FLOPs | Top-1 Acc |
|---|---|---|---|---|---|---|---|
| B0 | 0 | 1.0 | 1.0 | 224 | 5.3M | 0.39B | 77.1% |
| B1 | 1 | 1.2 | 1.1 | 240 | 7.8M | 0.71B | 79.1% |
| B2 | 2 | 1.44 | 1.21 | 260 | 9.2M | 1.03B | 80.1% |
| B3 | 3 | 1.73 | 1.33 | 300 | 12M | 1.87B | 81.6% |
| B4 | 4 | 2.07 | 1.47 | 380 | 19M | 4.2B | 82.9% |
| B5 | 5 | 2.48 | 1.62 | 456 | 30M | 9.9B | 83.6% |
| B6 | 6 | 2.98 | 1.78 | 528 | 43M | 19B | 84.0% |
| B7 | 7 | 3.58 | 1.97 | 600 | 66M | 37B | 84.3% |
| Q Why Compound Scaling Works ? | |||||||
| A |
- Receptive Field Growth - Receptive field grows with both depth and resolution:
RF_new = RF_old + (num_new_layers · stride_growth) · resolution_factorLarger images need deeper networks to capture long-range context; shallow networks on high-res images waste compute 2. Channel Growth Logic - With more layers and larger feature maps, network benefits from proportionally more channels to avoid information bottlenecks:
optimal_width ∝ √(depth) · √(resolution)This explains β ≈ 1.1 (modest width increase) but γ ≈ 1.15 (sharper resolution increase)
Summary: The Three Scaling Rules
- Fixed Exponents (α=1.2, β=1.1, γ=1.15) balance growth across depth, width, resolution based on small grid search
- Compound Coefficient φ uniformly controls all three dimensions for any resource budget
- Constraint (α·β²·γ² ≈ 2) ensures 2x resources yield predictable scaling
Result: Spend computational budget efficiently across all dimensions simultaneously, not just one. This unlocks a family of models from one baseline that dominate at every accuracy-efficiency point