EfficientNet (2019) by Mingxing Tan and Quoc V. Le, Google Brain — Won first place in MLCommons Inference Benchmark and set new records for accuracy-efficiency balance. A single baseline model (EfficientNet-B0) scaled uniformly to create a family of 8 models (B0 through B7). EfficientNet-B7 achieves 84.3% ImageNet top-1 accuracy while being 8.4x smaller and 6.1x faster than the previous best ConvNet (GPipe, 557M parameters vs EfficientNet-B7’s 66M)
Before EfficientNet, researchers scaled CNNs by increasing one dimension at a time: depth (more layers), width (more channels), or resolution (larger inputs). EfficientNet answered: “What if we scale all three dimensions together using a principled formula?” The insight: these dimensions are interdependent and must be balanced for optimal accuracy and efficiency

In Traditional Scaling (Pick One)

  • Depth scaling only (ResNet-18 → ResNet-200): Vanishing gradients, harder to train
  • Width scaling only (Wide ResNets): Quadratic compute increase with minimal accuracy gain
  • Resolution scaling only: Expensive (O(n²) pixel increase) without architecture to leverage it
  • Result: Diminishing returns, suboptimal accuracy-efficiency trade-off

The Key Insight is Depth, width, and resolution are interdependent:

  • Bigger images → network needs more layers (larger receptive field)
  • More layers → network needs more channels (process details better)
  • These must grow in balanced proportion for efficiency

Compound Scaling - The Core Innovation

Instead of scaling one dimension, scale all three simultaneously using fixed ratios controlled by a single coefficient φ (phi):

d' = d · α^φ       (new depth)
w' = w · β^φ       (new width)   
r' = r · γ^φ       (new resolution)

Constraint (ensures balanced growth):

α · β² · γ² ≈ 2

Intuition: If you have 2x more computational budget, increase depth by α^1, width by β^1, and resolution by γ^1, where α, β, γ are discovered constants

Finding Optimal Exponents (α, β, γ)

Step 1: Fix φ = 1, assume 2x more resources, grid search for best α, β, γ:

  • Found empirically: α = 1.2, β = 1.1, γ = 1.15
  • Verify: 1.2 × 1.1² × 1.15² ≈ 2 ✓

Step 2: Keep α, β, γ constant, vary φ to create B0 through B7:

  • φ = 0 → EfficientNet-B0 (baseline)
  • φ = 1 → B1 (2x resources)
  • φ = 2 → B2 (4x resources)
  • … and so on

EfficientNet-B0 - The Baseline

EfficientNet-B0 uses MBConv (Mobile Inverted Bottleneck) blocks as core building unit, inspired by MobileNetV2 with added Squeeze-and-Excitation (SE) attention
MBConv Block Structure: Step 1: Expand (increase channels)

x → Conv(1×1) → BN → Swish → x_expanded

Step 2: Depthwise Conv (spatial filtering per channel)

x_expanded → DWConv(3×3 or 5×5) → BN → Swish → x_dw

Step 3: Squeeze-and-Excitation (channel attention)

x_dw → GlobalAvgPool → FC → Swish → FC → Sigmoid → scale_weights x_dw * scale_weights → x_se

Step 4: Project (reduce channels with linear activation)

x_se → Conv(1×1) → BN → x_project

Step 5: Skip Connection (residual if input/output match)

output = x_project + x   (if dims match, else just x_project)

Squeeze-and-Excitation (SE) Block Details

Purpose: Learn per-channel importance weights to adaptively recalibrate features Mechanism:

  1. Squeeze (global context): Global average pooling reduces spatial dimensions
s_c = (1/(H×W)) · Σ_x_i,j^c
  1. Excitation (channel relationships): Two FC layers learn channel interdependencies
z_c = FC2(Swish(FC1(s))) ∈ [0,1]  (Sigmoid output)
  1. Scale (recalibrate features):
x_c^scaled = z_c · x_c

Result: Each channel weighted by learned importance, dramatically improving feature quality with minimal overhead

Scaling B0 to Create B1-B7:

ModelφDepth (α^φ)Width (β^φ)Resolution (γ^φ)ParametersFLOPsTop-1 Acc
B001.01.02245.3M0.39B77.1%
B111.21.12407.8M0.71B79.1%
B221.441.212609.2M1.03B80.1%
B331.731.3330012M1.87B81.6%
B442.071.4738019M4.2B82.9%
B552.481.6245630M9.9B83.6%
B662.981.7852843M19B84.0%
B773.581.9760066M37B84.3%
Q Why Compound Scaling Works ?
A
  1. Receptive Field Growth - Receptive field grows with both depth and resolution:
RF_new = RF_old + (num_new_layers · stride_growth) · resolution_factor

Larger images need deeper networks to capture long-range context; shallow networks on high-res images waste compute 2. Channel Growth Logic - With more layers and larger feature maps, network benefits from proportionally more channels to avoid information bottlenecks:

optimal_width ∝ √(depth) · √(resolution)

This explains β ≈ 1.1 (modest width increase) but γ ≈ 1.15 (sharper resolution increase)​

Summary: The Three Scaling Rules

  1. Fixed Exponents (α=1.2, β=1.1, γ=1.15) balance growth across depth, width, resolution based on small grid search
  2. Compound Coefficient φ uniformly controls all three dimensions for any resource budget
  3. Constraint (α·β²·γ² ≈ 2) ensures 2x resources yield predictable scaling

Result: Spend computational budget efficiently across all dimensions simultaneously, not just one. This unlocks a family of models from one baseline that dominate at every accuracy-efficiency point