2. Neural Networks

Building Shallow Neural Network

A Shallow neural network with no activation functions will be implemented to classify images from a simple dataset
In a shallow network, there is an input layer, one or two hidden layers, and an output layer. A deep network has many hidden layers, enabling progressively abstract feature learning

flowchart LR
  Input[Input Layer]
  Hidden1[Hidden Layer 1]
  Hidden2[Hidden Layer 2]
  Output[Output Layer]

  Input --> Hidden1 --> Hidden2 --> Output

Reading the Dataset Convert JPEGs to ML-ready arrays:

flowchart TD
  File[Read JPEG File]
  RGB[Convert to RGB Values]
  Scale[Scale to 0–1]
  Resize[Resize to 224×224×3]
  File --> RGB --> Scale --> Resize

NumPy arrays run on CPU
TensorFlow tensors can leverage GPU for large computations

import numpy as np
import tensorflow as tf
import time
 
n = 10000
A_np = np.random.rand(n, n)
B_np = np.random.rand(n, n)
 
A_tf = tf.convert_to_tensor(A_np, dtype=tf.float32)
B_tf = tf.convert_to_tensor(B_np, dtype=tf.float32)
 
start = time.time()
C_np = np.dot(A_np, B_np)
numpy_time = time.time() - start
 
start = time.time()
C_tf = tf.matmul(A_tf, B_tf)
tf_time = time.time() - start
 
print(f"NumPy: {numpy_time:.2f}s, TensorFlow: {tf_time:.2f}s")

Steps for building Shallow Neural Network

Initialize weights randomly
Compute logits: Z = XW + b
SoftMax for class probabilities
Cross-entropy loss
Gradient descent to update W, b

flowchart LR
  X[Input Image Flattened]
  W[Weights]
  b[Bias]
  Z[Compute Z = X·W + b]
  Softmax[Softmax]
  Loss[Cross-Entropy Loss]
  Update[Gradient Descent]

  X --> Z --> Softmax --> Loss --> Update
  W --> Z
  b --> Z

Effect of Depth on Accuracy
- Shallow: Fast to train, limited feature extraction
- Deep: Higher accuracy on complex data, requires more compute

Building Deep Neural Networks

A Deep neural network has an input layer, many hidden layers, and an output layer. Each extra hidden layer enables learning more abstract features.

flowchart LR
  Input[Input Layer]
  Hidden1[Hidden Layer 1]
  Hidden2[Hidden Layer 2]
  Hidden3[Hidden Layer 3]
  HiddenN[Hidden Layer N]
  Output[Output Layer]

  Input --> Hidden1 --> Hidden2 --> Hidden3 --> HiddenN --> Output

Reading and Preprocessing Images
1. Read JPEG file
2. Convert to RGB values
3. Scale to 0–1
4. Resize to 224×224×3

flowchart TD
  File[Read JPEG File]
  RGB[Convert to RGB Values]
  Scale[Scale to 0–1]
  Resize[Resize to 224×224×3]
  File --> RGB --> Scale --> Resize

Layers and Activations
- Flatten image to 1D vector
- Dense layers with ReLU activation
- Final layer with SoftMax

model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(224,224,3)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(num_classes, activation='softmax')
])

Training Steps
1. Initialize weights randomly
2. Compute logits: Z = XW + b
3. Apply ReLU on hidden layers
4. SoftMax for class probabilities
5. Compute cross-entropy loss
6. Use gradient descent to update W, b

flowchart LR
  X[Input Vector]
  W[Weights]
  b[Bias]
  Z[Z = X·W + b]
  ReLU[ReLU]
  Softmax[SoftMax]
  Loss[Cross-Entropy]
  Update[Gradient Descent]

  X --> Z --> ReLU --> Softmax --> Loss --> Update
  W --> Z
  b --> Z

Effect of Network Depth
- Shallow: Fast training, limited feature extraction
- Deep: Better accuracy on complex data, needs more compute and tuning
Hardware Considerations
- NumPy arrays run on CPU
- TensorFlow tensors can leverage GPU for speed

# Compare CPU vs GPU for large matrix multiply
C_np = np.dot(A_np, B_np)        # CPU only
C_tf = tf.matmul(A_tf, B_tf)     # GPU if available

Deep networks build on shallow ones by adding non-linear layers to learn hierarchical features simply and effectively

Overfitting in Neural Networks

It occurs when a model becomes too specialized in the training data, memorizing its noise and irrelevant details instead of learning generalizable patterns, leading to poor performance on new, unseen images
There are four ways to prevent overfitting:

Regularization:
- It adds a penalty to the Loss term
- Like a neural networks does the prediction of the data, If the prediction is far away from the reality, then the loss is high, and the loss can be of various types like in regression the loss is mean square error and in classification it will be cross entropy loss, by adding a penalty term to the model’s loss function, which discourages large or complex parameter values. This forces the model to be simpler and more generalized, making it less likely to learn noise in the training data and better able to perform well on new, unseen data
$Loss = Loss + λ i \sum (w_{i})^{2}$ $Loss = Loss + λ i \sum ∣ w_{i} ∣$
```
In Regularization like L1 and L2 we add a penalty term to the loss function that discourages large weights
```
Dropout:
- It randomly disabling a portion of neurons during each training step, forcing the network to learn more robust and generalized features rather than relying on specific neurons. This process introduces randomness, makes the network less sensitive to individual neurons, prevents complex co-adaptations, and effectively creates a diverse ensemble of smaller networks, all contributing to better generalization on unseen data
  - This means with probability p, a neuron is dropped(Set to Zero)
  - With probability 1-p, it’s kept and used normally
- This is done independently for each neuron and for each mini-batch during training, During Inference, all neurons are used but their outputs are scaled accordingly
Early Stopping:
- It halts model training once performance on a separate validation set begins to degrade, rather than continuing until the fixed number of epochs is complete. This approach stops the model from learning noisy or irrelevant patterns in the training data, ensuring the model retains its ability to generalize well to unseen data. By monitoring the validation loss or accuracy, the system identifies the “sweet spot” where the model performs best on new data, and it can even revert to the model’s best historical state from that point
Batch Normalization:
- It adds noise through mini-batch statistics, making the model less sensitive to specific weight combinations and more robust. It also improves gradient flow, allowing the network to reach a more generalized solution faster and with less reliance on a complex model structure that could easily memorize the training data

Transfer Learning

Many people train LLMs on different datasets, thus rather than we training it again, using this technique we transfer the LLMs learning from training(embeddings) to out LLMs
There are three types of transfer learning

Fixed Feature Extractor (No Fine-tuning)
- Load pretrained model without classification head
- Freeze the base model (initially)
- Add new classification head
With Fine Tuning
- Load pretrained model without classification head
- Unfreeze the base model
- Recompile with smaller learning rate

Learning Rate Management

Learning Rate Rescheduling
- Problem: Need large LR initially for quick adaptation, then smaller LR for careful fine-tuning
- Solution: Decay learning rate during training
- Risk: Too high = oscillation, too low = slow convergence
Differential Learning Rates
- Problem: Base model (already trained) vs new classifier head (needs fast learning)
- Solution: Different learning rates for different parts:
  - Small LR for pretrained base model (gentle updates)
  - Larger LR for new classifier head (fast learning)
Schedulers Available
- Step decay scheduler
- Exponential decay scheduler

Decision Framework

Freeze vs Fine-tune Decision Based On:
- Size of your dataset
- Similarity between your task and pretrained task

Sadiq's Knowledge Vaults

Explorer