NASNet (2017) by Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le from Google Brain. The first major CNN architecture designed by artificial intelligence, not humans. Using Neural Architecture Search (NAS) with reinforcement learning, Google trained an RNN controller to discover optimal building blocks (cells) on CIFAR-10, then transferred them to ImageNet. NASNet achieved 82.7% top-1 accuracy on ImageNet, surpassing all hand-designed models while using 28% fewer FLOPs than previous state-of-the-art
By 2017, human-designed architectures (VGG, ResNet, Inception, DenseNet) had plateaued. Engineers spent months tweaking layer configurations for marginal gains. Google Brain asked: “Can a neural network design better neural networks than humans?” The answer was NASNet, proving that AutoML could outperform human expertise in architecture engineering

Architecture

Traditional Process (Human-Designed)

  1. Expert proposes architecture based on intuition
  2. Train on dataset (weeks of GPU time)
  3. Evaluate performance
  4. Manually tweak layers, connections, hyperparameters
  5. Repeat until marginal improvements stop
  6. Result: Time-consuming, biased by human assumptions, limited exploration

NAS Solution (AI-Designed)

  1. RNN controller generates thousands of candidate architectures
  2. Each candidate trained and evaluated automatically
  3. Reinforcement learning updates controller based on validation accuracy
  4. Controller learns to propose better architectures over time
  5. Result: Explores architectural space far beyond human creativity

How NAS Works
It has three Core Components

  1. Search Space (what architectures are possible)
    • Defines layer types: conv, pooling, separable conv, identity
    • Connection patterns between layers
    • Number of filters, kernel sizes, skip connections
  2. Search Strategy (how to explore the space)
    • Reinforcement Learning: RNN controller generates architectures
    • Reward signal: Validation accuracy on target dataset
    • Policy gradient: Updates controller to favor high-accuracy designs
  3. Performance Evaluation (how to rank candidates)
    • Train each generated architecture to convergence
    • Measure validation accuracy as reward
    • Feed reward back to controller for learning
Step 1: Controller RNN samples architecture A from search space
Step 2: Train architecture A on CIFAR-10 dataset
Step 3: Evaluate A on validation set → accuracy R
Step 4: Use R as reward to update controller via policy gradient
Step 5: Repeat Steps 1-4 for thousands of iterations
Step 6: Select best-performing architecture as final model

Computational Cost: Original NAS required 800 GPUs for 28 days to find optimal architecture. NASNet improved this to 500 GPUs for 4 days

The Transferability Problem

Training full architectures on ImageNet is prohibitively expensive (each candidate = days of training)
Solution: Search Small, Transfer Large

  1. Search on CIFAR-10 (small 32×32 images, fast to train)
  2. Discover optimal cell structures (not full networks)
  3. Transfer cells to ImageNet by stacking more copies
  4. Each cell has independent parameters when stacked

Insight: Good architectural building blocks (cells) generalize across datasets they are of two types

  1. Normal Cell
    • Purpose: Feature processing while maintaining spatial dimensions
    • Input and output have same height and width
    • Stacked repeatedly to increase network depth
    • Example: 224×224 → 224×224
  2. Reduction Cell
    • Purpose: Downsampling to reduce spatial dimensions
    • Output height/width = Input height/width ÷ 2
    • Applied at specific positions (similar to pooling layers)
    • Example: 224×224 → 112×112
Input (e.g., 331×331×3)

Stem Convolutions (initial processing)

N × Normal Cell

Reduction Cell (downsample)

N × Normal Cell

Reduction Cell (downsample)

N × Normal Cell

Global Average Pooling

Fully Connected + Softmax

N = number of cell repetitions (more N = deeper network)
Each cell is a directed acyclic graph (DAG) with 5 blocks, each performing the following steps:

  • Step 1: Select hidden state h_i from previous layers or current cell
  • Step 2: Select second hidden state h_j from previous layers or current cell
  • Step 3: Choose operation to apply to h_i:
    • 3×3 depthwise separable conv
    • 5×5 depthwise separable conv
    • 3×3 average pooling
    • 3×3 max pooling
    • Identity (skip connection)
    • Others
  • Step 4: Choose operation to apply to h_j (same options as Step 3)
  • Step 5: Combine outputs via:
    • Element-wise addition
    • Concatenation along channel dimension

Result: Each block produces a new hidden state that becomes input for subsequent blocks
Cell Output:
All unused hidden states at the end are concatenated along the channel dimension to form final cell output

NASNet Variants: Mobile to Large

ModelInput SizeParametersFLOPsTop-1 AccTop-5 AccUse Case
NASNet-Mobile224×2245.3M564M74.0%91.6%Mobile devices
NASNet-A224×22482.7%96.2%Balanced
NASNet-Large331×33188.9M23.8B82.7%96.2%High accuracy

ScheduledDropPath: Critical Regularization

NASNet cells have many parallel paths connecting layers. Without regularization, models overfit badly

  • Standard DropPath: Randomly drop entire paths during training with fixed probability
  • ScheduledDropPath (NASNet Innovation): Drop paths with probability that linearly increases during training:
drop_prob(epoch) = 0.0 at start → 0.7 at end

Why it works:

  • Early training: Keep all paths (learn diverse features)
  • Late training: Aggressively drop paths (force ensemble-like regularization)
  • Result: Significantly better generalization on ImageNet

Without ScheduledDropPath: NASNet overfits
With ScheduledDropPath: State-of-the-art accuracy​

Key Architectural Details

Search Space Specification

  • Operations: 13 different ops (various convs, poolings, identity)
  • Connections: Each block connects to any previous hidden state
  • Search scope: Normal cell structure + Reduction cell structure
  • Fixed: Macro-architecture (how cells stack), only cell internals searched

Cell Stacking Strategy

NASNet-Mobile: N=4 cells per stage, F=44 initial filters
NASNet-Large: N=6 cells per stage, F=168 initial filters

Where N = repetitions, F = base filter count

Summary

NASNet proved three critical points:

  1. Automated architecture search outperforms human experts when given sufficient compute
  2. Transferable cells unlock practical NAS (search small, deploy large)
  3. Reinforcement learning effectively explores architectural search spaces

The Trade-off:

  • Pro: State-of-the-art accuracy with less human effort
  • Con: Enormous computational cost (500 GPUs × 4 days)
  • Resolution: Sparked research into efficient NAS (ENAS, DARTS), making AutoML practical