10. EfficientNet

EfficientNet (2019) by Mingxing Tan and Quoc V. Le, Google Brain — Won first place in MLCommons Inference Benchmark and set new records for accuracy-efficiency balance. A single baseline model (EfficientNet-B0) scaled uniformly to create a family of 8 models (B0 through B7). EfficientNet-B7 achieves 84.3% ImageNet top-1 accuracy while being 8.4x smaller and 6.1x faster than the previous best ConvNet (GPipe, 557M parameters vs EfficientNet-B7’s 66M)
Before EfficientNet, researchers scaled CNNs by increasing one dimension at a time: depth (more layers), width (more channels), or resolution (larger inputs). EfficientNet answered: “What if we scale all three dimensions together using a principled formula?” The insight: these dimensions are interdependent and must be balanced for optimal accuracy and efficiency

In Traditional Scaling (Pick One)

Depth scaling only (ResNet-18 → ResNet-200): Vanishing gradients, harder to train
Width scaling only (Wide ResNets): Quadratic compute increase with minimal accuracy gain
Resolution scaling only: Expensive (O(n²) pixel increase) without architecture to leverage it
Result: Diminishing returns, suboptimal accuracy-efficiency trade-off

The Key Insight is Depth, width, and resolution are interdependent:

Bigger images → network needs more layers (larger receptive field)
More layers → network needs more channels (process details better)
These must grow in balanced proportion for efficiency

Compound Scaling - The Core Innovation

Instead of scaling one dimension, scale all three simultaneously using fixed ratios controlled by a single coefficient φ (phi):

d' = d · α^φ       (new depth)
w' = w · β^φ       (new width)   
r' = r · γ^φ       (new resolution)

Constraint (ensures balanced growth):

α · β² · γ² ≈ 2

Intuition: If you have 2x more computational budget, increase depth by α^1, width by β^1, and resolution by γ^1, where α, β, γ are discovered constants

Finding Optimal Exponents (α, β, γ)

Step 1: Fix φ = 1, assume 2x more resources, grid search for best α, β, γ:

Found empirically: α = 1.2, β = 1.1, γ = 1.15
Verify: 1.2 × 1.1² × 1.15² ≈ 2 ✓

Step 2: Keep α, β, γ constant, vary φ to create B0 through B7:

φ = 0 → EfficientNet-B0 (baseline)
φ = 1 → B1 (2x resources)
φ = 2 → B2 (4x resources)
… and so on

EfficientNet-B0 - The Baseline

EfficientNet-B0 uses MBConv (Mobile Inverted Bottleneck) blocks as core building unit, inspired by MobileNetV2 with added Squeeze-and-Excitation (SE) attention
MBConv Block Structure: Step 1: Expand (increase channels)

x → Conv(1×1) → BN → Swish → x_expanded

Step 2: Depthwise Conv (spatial filtering per channel)

x_expanded → DWConv(3×3 or 5×5) → BN → Swish → x_dw

Step 3: Squeeze-and-Excitation (channel attention)

x_dw → GlobalAvgPool → FC → Swish → FC → Sigmoid → scale_weights x_dw * scale_weights → x_se

Step 4: Project (reduce channels with linear activation)

x_se → Conv(1×1) → BN → x_project

Step 5: Skip Connection (residual if input/output match)

output = x_project + x   (if dims match, else just x_project)

Squeeze-and-Excitation (SE) Block Details

Purpose: Learn per-channel importance weights to adaptively recalibrate features Mechanism:

Squeeze (global context): Global average pooling reduces spatial dimensions

s_c = (1/(H×W)) · Σ_x_i,j^c

Excitation (channel relationships): Two FC layers learn channel interdependencies

z_c = FC2(Swish(FC1(s))) ∈ [0,1]  (Sigmoid output)

Scale (recalibrate features):

x_c^scaled = z_c · x_c

Result: Each channel weighted by learned importance, dramatically improving feature quality with minimal overhead

Scaling B0 to Create B1-B7:

Model	φ	Depth (α^φ)	Width (β^φ)	Resolution (γ^φ)	Parameters	FLOPs	Top-1 Acc
B0	0	1.0	1.0	224	5.3M	0.39B	77.1%
B1	1	1.2	1.1	240	7.8M	0.71B	79.1%
B2	2	1.44	1.21	260	9.2M	1.03B	80.1%
B3	3	1.73	1.33	300	12M	1.87B	81.6%
B4	4	2.07	1.47	380	19M	4.2B	82.9%
B5	5	2.48	1.62	456	30M	9.9B	83.6%
B6	6	2.98	1.78	528	43M	19B	84.0%
B7	7	3.58	1.97	600	66M	37B	84.3%
Q Why Compound Scaling Works ?
A

Receptive Field Growth - Receptive field grows with both depth and resolution:

RF_new = RF_old + (num_new_layers · stride_growth) · resolution_factor

Larger images need deeper networks to capture long-range context; shallow networks on high-res images waste compute 2. Channel Growth Logic - With more layers and larger feature maps, network benefits from proportionally more channels to avoid information bottlenecks:

optimal_width ∝ √(depth) · √(resolution)

This explains β ≈ 1.1 (modest width increase) but γ ≈ 1.15 (sharper resolution increase)

Summary: The Three Scaling Rules

Fixed Exponents (α=1.2, β=1.1, γ=1.15) balance growth across depth, width, resolution based on small grid search
Compound Coefficient φ uniformly controls all three dimensions for any resource budget
Constraint (α·β²·γ² ≈ 2) ensures 2x resources yield predictable scaling

Result: Spend computational budget efficiently across all dimensions simultaneously, not just one. This unlocks a family of models from one baseline that dominate at every accuracy-efficiency point

Sadiq's Knowledge Vaults

Explorer

Compound Scaling - The Core Innovation

Finding Optimal Exponents (α, β, γ)

EfficientNet-B0 - The Baseline

Squeeze-and-Excitation (SE) Block Details

Summary: The Three Scaling Rules

Graph View