MobileNet is a family of lightweight CNNs designed by Google for efficient on-device vision, built around depthwise separable convolutions to minimize parameters and computation while retaining strong accuracy on mobile and embedded hardware.
Introduced with V1 and improved in V2 via inverted residuals and linear bottlenecks, MobileNet targets low latency, low power, and small model size without sacrificing much accuracy
The rise of mobile and edge AI created demand for models that run in real time under tight compute, memory, and energy budgets, prompting architectures like MobileNet to replace standard convolutions with more efficient operators and scaling knobs to meet diverse device constraints
MobileNet is commonly deployed via TensorFlow Lite and similar runtimes for on-device inference across phones, IoT, and embedded systems

Depthwise Separable Convolutions

MobileNet factors a standard convolution into:

  1. a depthwise convolution (per-channel spatial filtering), and
  2. a 1×1 pointwise convolution (channel mixing),

drastically cutting parameters and multiply-accumulates compared to full convolutions. In notation:

where applies one spatial kernel per input channel and the convolution mixes channels afterward

Architecture

MobileNet V1 uses a regular convolution as the first layer, then stacks depthwise separable blocks (depthwise followed by pointwise )
Some depthwise layers use stride 2 to downsample spatially, while pointwise layers adjust channel width. It ends with average pooling and a classification layer
For an input of size , the network produces a feature map before pooling, relying on strided depthwise layers instead of separate pooling layers for downsampling

Efficiency Knobs - α and ρ

MobileNet introduces two scaling hyperparameters:

  • Width multiplier : uniformly thins channels at each layer
  • Resolution multiplier : reduces input resolution to cut compute and memory

Formally, if a layer has channels and input dimensions ,
then the scaled configuration uses channels and resolution , i.e.:

This reduces MACs and parameters proportionally
By shifting most compute to efficient convolutions, MobileNet spends the majority of time and parameters in channel mixing operations that map well to optimized GEMM kernels on mobile hardware — significantly reducing latency and energy
The total MACs are proportional to the number of fused multiply–add operations, so reducing channel counts and resolution directly lowers runtime on constrained devices.

MobileNet V2

V2 refines V1 with inverted residuals and linear bottlenecks to improve the accuracy–efficiency trade-off while preserving mobile friendliness across tasks like classification and detection

  • Inverted residual block:
    expand → depthwise → project, with a residual connection when input/output shapes match.
    This enables gradient flow through a thin bottleneck while computing features in an expanded space In compact form:with the shortcut used when shapes align.
  • Linear bottleneck:
    the projection uses a linear activation (no nonlinearity) to avoid information loss in low-dimensional spaces.
  • ReLU6:
    used for stability on low-precision hardware while maintaining efficient inference on mobile accelerators.

Practical Structure (V2)

A V2 network begins with a standard stride-2 convolution, then stacks inverted residual blocks with varying expansion factors and strides to downsample, followed by global average pooling and a final classification layer, targeting input size for ImageNet benchmarks
Shortcut connections are used within blocks when stride = 1 and input/output channels match, improving gradient flow without extra parameters

The Maths

  • Depthwise separable composition:$$
    \text{DSConv}(x) = \text{Conv}{1\times1}\big(\text{DWConv}{D_K\times D_K}(x)\big)
  • V2 block summary: with residual connection:
  • Width & resolution scaling:

MobileNet variants are broadly available in modern frameworks and widely used for mobile classification and detection — especially via TensorFlow Lite and similar runtimes for efficient on-device inference.
While newer versions (like V3, V4) exist, V1 and V2 remain canonical baselines for lightweight vision under tight resource constraints