Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Summary - Advanced Deep Learning

Abstract

Keywords:
Notebook Cell
import torch
import torch.nn as nn
import torchvision
from torchinfo import summary
from torcheval.metrics import MulticlassAccuracy

import numpy as np

import wandb

Network Architecture Design Patterns

Output Size Decrease

Problem: The input image (e.g., 1000x1000) is too large to feed directly into a classifier (MLP), which would result in millions of parameters,.

Why: High resolution contains redundant information; we need to condense features.

Solution: Use Pooling (Max/Average) or Strided Convolutions,. Strided convolutions are preferred in modern networks as they allow the network to learn how to downsample adaptively.

# Option A: Max Pooling
torch.nn.MaxPool2d(kernel_size=2, stride=2)
# Option B: Strided Convolution (Preferred)
torch.nn.Conv2d(in_channels=64, out_channels=128,
  kernel_size=3, stride=2, padding=1)
Output
Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))

Normalization

Problem: Deep networks are difficult to train and may not converge.

Why: Data distribution shifts as it travels through layers (internal covariate shift), and inputs to later layers are not normalized.

Solution: Batch Normalization. It learns to normalize data (mean 0, std 1) inside the network, enabling higher learning rates and faster training.

torch.nn.Sequential(
  torch.nn.Conv2d(64, 64, kernel_size=3, padding=1),
  torch.nn.BatchNorm2d(64), # Batch Normalization
  torch.nn.ReLU())
Output
Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() )

Residual Connections

Problem: Adding more layers to a deep network (e.g., >30) can degrade performance and stop convergence,.

Why: Gradients vanish during backpropagation, and it is difficult for layers to learn the identity function (doing nothing) if needed

Solution: Add the input xx to the output of the layer F(x)F(x), creating a Skip Connection (y=F(x)+xy = F(x) + x). This allows gradients to flow through the network easily.

class ResidualBlock(torch.nn.Module):
  def forward(self, x):
    identity = x
    out = self.conv_layers(x) # F(x)
    return out + identity     # F(x) + x

Output Size Increase

Problem: Generative models (like autoencoders) need to reconstruct an image from a small latent vector.

Solution: Upsampling (repetition) or Transposed Convolutions. Transposed convolutions learn weights to optimally upsample the data.

# Option A: Upsampling
torch.nn.Upsample(scale_factor=2, mode='nearest')
# Option B: Transposed Convolution (Learnable)
torch.nn.ConvTranspose2d(in_channels=64,
  out_channels=32, kernel_size=3, stride=2)
Output
ConvTranspose2d(64, 32, kernel_size=(3, 3), stride=(2, 2))

Channel Number Decrease (Bottleneck)

Problem: Computational cost is too high when convolutional filters operate on tensors with large depth (e.g., 256 channels).

Solution: Bottleneck Layers. Use a 1×11 \times 1 convolution to reduce the number of channels (e.g., to 64), perform the expensive 3×33 \times 3 convolution, and then scale back up.

bottleneck = torch.nn.Sequential(
  # Compress:
  torch.nn.Conv2d(256, 64, kernel_size=1),
  # Process:
  torch.nn.Conv2d(64, 64, kernel_size=3, padding=1),
  # Expand:
  torch.nn.Conv2d(64, 256, kernel_size=1))

Deep Networks with Fewer Parameters

Problem: Large filters (e.g., 11×1111 \times 11) have a massive number of parameters (11×11=12111 \times 11 = 121 weights per channel).

Solution: Stack multiple small filters (e.g., 3×33 \times 3). Two stacked 3×33 \times 3 layers have a receptive field of 5×55 \times 5 but fewer parameters and more non-linearities (activation functions), making training easier.

# Replaces one large 5x5 convolution
stack = torch.nn.Sequential(
  torch.nn.Conv2d(64, 64, kernel_size=3, padding=1),
  torch.nn.ReLU(),
  torch.nn.Conv2d(64, 64, kernel_size=3, padding=1),
  torch.nn.ReLU())

Multiple Resolution (Inception)

Problem: It is unclear which kernel size (1×11 \times 1, 3×33 \times 3, or 5×55 \times 5) is best for a specific feature.

Solution: Inception Modules. Compute convolutions with different kernel sizes in parallel and concatenate the results.

class InceptionModule(torch.nn.Module):
  def forward(self, x):
    p1 = self.conv1x1(x)
    p2 = self.conv3x3(x)
    p3 = self.conv5x5(x)
    # Concatenate along channel dimension:
    return torch.cat([p1, p2, p3], dim=1)

Faster Large Convolutions

Problem: Standard 2D convolutions are computationally expensive.

Solution: Separable Convolutions. Spatial Separation: Replace an N×NN \times N filter with a 1×N1 \times N and N×1N \times 1 filter. Depth-wise Separation: Use a single filter per channel (Depth-wise) followed by a 1×11 \times 1 filter to mix channels (Point-wise),.

separable = torch.nn.Sequential(
  # Depthwise: groups=in_channels => 1 filter per channel
  torch.nn.Conv2d(32, 32, kernel_size=3, groups=32),
  # Pointwise: 1x1 conv to mix channels
  torch.nn.Conv2d(32, 64, kernel_size=1))

Share Features (Multi-Head)

Problem: You need multiple outputs (e.g., move selection AND win probability in chess) from the same input.

Solution: Multi-headed Networks. Use a shared “backbone” to extract features, then split into separate “heads” (MLPs) for different tasks. Gradients from both heads help train the backbone.

class MultiHeadNet(torch.nn.Module):
  def forward(self, x):
    features = self.shared_backbone(x)
    class_out = self.classification_head(features)
    reg_out = self.regression_head(features)
    return class_out, reg_out

Generate Sparse Layers (Dropout)

Problem: Overfitting; the network relies too heavily on specific neurons.

Solution: Dropout. Randomly set the output of some neurons to zero during training. This forces the network to generalize and become robust to missing data.

torch.nn.Dropout(p=0.5)
Output
Dropout(p=0.5, inplace=False)

Constrain Model Parameters

Problem: Overfitting due to large weights.

Solution: L2 Regularization (Weight Decay). Add a penalty term to the loss function proportional to the size of the weights, forcing them towards zero.

Notebook Cell
model = torch.nn.TransformerEncoder(torch.nn.TransformerEncoderLayer(d_model=512, nhead=8), num_layers=6)
optimizer = torch.optim.SGD(
   model.parameters(), lr=0.01, weight_decay=1e-5)

PyTorch Training Pipeline

# Make training reproducible:
seed = 1234
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

# Preprocess data:
transform = torchvision.transforms.Compose([
    torchvision.transforms.ToTensor()])
data_train = torchvision.datasets.MNIST(
    root='data/mnist', download=True,
    transform=transform)
# data_test = ...MNIST(...train=False)
Notebook Cell
data_test = torchvision.datasets.MNIST(root='data/mnist', train=False, download=True, transform=transform)

class SimpleCNN(nn.Module):
  def __init__(self):
    super().__init__()
    self.layers = nn.Sequential(
      nn.Conv2d(1, out_channels=4, kernel_size=3),
      nn.ReLU(),
      nn.MaxPool2d(2, 2),
      nn.Conv2d(4, 8, 3),
      nn.ReLU(),
      nn.MaxPool2d(2, 2),
      nn.Conv2d(8, 8, 3),
      nn.ReLU(),
      nn.Flatten(),
      nn.Linear(72, 120),
      nn.ReLU(),
      nn.Linear(120, 10)
    )

  def forward(self, x):
    return self.layers.forward(x)
model = SimpleCNN()
# Split data into train and validation sets:
len_train = (int)(0.8 * len(data_train))
len_val = len(data_train) - len_train
data_train_subset, data_val_subset =(
  torch.utils.data.random_split(
    data_train, [len_train, len_val]))

# Construct data loaders for data sets:
BATCH_SIZE = 64
data_train_loader = torch.utils.data.DataLoader(
  dataset=data_train_subset, shuffle=True,
  batch_size=BATCH_SIZE)
data_val_loader = torch.utils.data.DataLoader(
  dataset=data_val_subset, shuffle=False,
  batch_size=BATCH_SIZE)
data_test_loader = torch.utils.data.DataLoader(
  data_test, batch_size=64)

wandb.login()
def train(epochs: int, model, loss_fn, optim,
    metrics, device):
  wandb.init(project="mnist-example",
    config={'epochs': epochs,
        'batch_size':
          data_train_loader.batch_size})
  step_count = 0
  model = model.to(device)
  # Training:
  for epoch in range(epochs):
      model.train()
      metrics.reset()
      for step, (inputs, labels) in\
              enumerate(data_train_loader):
          inputs = inputs.to(device)
          labels = labels.to(device)

          # Zero your gradients for every batch!
          optim.zero_grad()

          outputs = model(inputs)
          _, predicted = torch.max(outputs, 1)

          train_loss = loss_fn(outputs, labels)
          train_loss.backward()
          optim.step()

          metrics.update(predicted, labels)
          train_acc = metrics.compute()

          train_metrics = {
              'train/train_loss:': train_loss,
              'train/train_acc': train_acc,
              'train/epoch': epoch}

          step_count += 1
          wandb.log(train_metrics, step=step_count)
      # Validation:
      model.eval()
      metrics.reset()
      val_loss = []
      val_steps = 0
      for step, (inputs, labels) in\
              enumerate(data_val_loader):
          inputs = inputs.to(device)
          labels = labels.to(device)
          with torch.no_grad():
              outputs = model(inputs)
              _, predicted = torch.max(outputs, 1)

              val_loss.append(
                loss_fn(outputs, labels).item())
              metrics.update(predicted, labels)
          val_steps += 1

      val_acc = metrics.compute()
      val_loss_mean = np.mean(val_loss)
      val_metrics = {'val/val_loss': val_loss_mean,
                     'val/val_acc' : val_acc}
      wandb.log(val_metrics, step=val_steps)
      print(f"Epoch {epoch:02} ...")
  wandb.finish()
Output
model = SimpleCNN()
metrics = MulticlassAccuracy(num_classes=10)
optim = torch.optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()
device = 'cuda' if torch.cuda.is_available() else\
    'mps' if torch.mps.is_available() else 'cpu'
epochs = 10

train(epochs, model, loss_fn, optim, metrics, device)
Output
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Epoch 00 ...
Epoch 01 ...
Epoch 02 ...
Epoch 03 ...
Epoch 04 ...
Epoch 05 ...
Epoch 06 ...
Epoch 07 ...
Epoch 08 ...
Epoch 09 ...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

MLP: Multi Layer Perceptron

  • ‘Simple’ Problems where the input is features

  • Output layers in a CNN after feature extractions

  • Feature transformation (for example after attention layers)

  • Dimensionality reduction

Each node calculates its output yy based in the inputs xx, the weights ww (on the edges), a bias value bb and the activation function σ\sigma:

y=σ(k=1nwkxk+b)y = \sigma \left( \sum_{k=1}^{n} w_k x_k + b \right)

CNN: Convolutional Neural Network

class SimpleCNN(nn.Module):
  def __init__(self):
    super().__init__()
    self.layers = nn.Sequential(
      nn.Conv2d(1,out_channels=4,kernel_size=3),
      nn.ReLU(),
      nn.MaxPool2d(2, 2), nn.Conv2d(4, 8, 3),
      nn.ReLU(),
      nn.MaxPool2d(2, 2), nn.Conv2d(8, 8, 3),
      nn.ReLU(), nn.Flatten(),
      nn.Linear(72, 120),nn.ReLU(),nn.Linear(120, 10))
  def forward(self, x): return self.layers.forward(x)
summary(SimpleCNN(), input_size=(64, 1, 28, 28))
Output
========================================================================================== Layer (type:depth-idx) Output Shape Param # ========================================================================================== SimpleCNN [64, 10] -- ├─Sequential: 1-1 -- -- │ └─Conv2d: 2-1 [64, 4, 26, 26] 40 │ └─ReLU: 2-2 [64, 4, 26, 26] -- │ └─MaxPool2d: 2-3 [64, 4, 13, 13] -- │ └─Conv2d: 2-4 [64, 8, 11, 11] 296 │ └─ReLU: 2-5 [64, 8, 11, 11] -- │ └─MaxPool2d: 2-6 [64, 8, 5, 5] -- │ └─Conv2d: 2-7 [64, 8, 3, 3] 584 │ └─ReLU: 2-8 [64, 8, 3, 3] -- │ └─Flatten: 2-9 [64, 72] -- │ └─Linear: 2-10 [64, 120] 8,760 │ └─ReLU: 2-11 [64, 120] -- │ └─Linear: 2-12 [64, 10] 1,210 ========================================================================================== Total params: 10,890 Trainable params: 10,890 Non-trainable params: 0 Total mult-adds (Units.MEGABYTES): 5.00 ========================================================================================== Input size (MB): 0.20 Forward/backward pass size (MB): 1.98 Params size (MB): 0.04 Estimated Total Size (MB): 2.23 ==========================================================================================

Why CNNs?

In Deep Learning you usually start with raw data and you aim to learn features first - then based on these learned features, you would then try to solve the actual problem for which we can again use MLPs for example (i.e. in classification).

Fundamental Design Principles

Data Transformation: Standard CNNs typically decrease spatial resolution (width/height) while increasing the number of channels (depth) to extract higher-level features.

Downsampling: Accomplished via Pooling (Max/Average) or Strided Convolutions. Strided convolutions (e.g., stride=2stride=2) are preferred in modern networks as the weights are learnable, allowing the network to adaptively reduce resolution.

Max Pooling (Downsampling)

[6232145112341056]Max Pooling (2×2)[6526]\begin{array}{ccc} \begin{bmatrix} 6 & 2 & 3 & 2 \\ 1 & 4 & 5 & 1 \\ 1 & 2 & 3 & 4 \\ 1 & 0 & 5 & 6 \end{bmatrix} & \xrightarrow{\text{Max Pooling (2$\times$2)}} & \begin{bmatrix} 6 & 5 \\ 2 & 6 \end{bmatrix} \end{array}

Max Unpooling (Upsampling)

Max pooling remembers max position. Max unpooling places value at correct position. Zeros fill non-max positions.

Max Pool (2×2)[6526]Max Unpool (2×2)[6002040000300006]\begin{array}{cccc} \xrightarrow{\text{Max Pool (2$\times$2)}} & \begin{bmatrix} 6 & 5 \\ 2 & 6 \end{bmatrix} & \xrightarrow{\text{Max Unpool (2$\times$2)}} & \begin{bmatrix} 6 & 0 & 0 & 2 \\ 0 & 4 & 0 & 0 \\ 0 & 0 & 3 & 0 \\ 0 & 0 & 0 & 6 \end{bmatrix} \end{array}

Average Pooling (Downsampling)

[6232145112341056]Avg Pooling (2×2)[3315]\begin{array}{ccc} \begin{bmatrix} 6 & 2 & 3 & 2 \\ 1 & 4 & 5 & 1 \\ 1 & 2 & 3 & 4 \\ 1 & 0 & 5 & 6 \end{bmatrix} & \xrightarrow{\text{Avg Pooling (2$\times$2)}} & \begin{bmatrix} 3 & 3 \\ 1 & 5 \end{bmatrix} \end{array}

Nearest Neighbor (Upsampling)

Replicates each value into 2x2 block (×2\times 2 scale).

[6526]Nearest Neighbour[6655665522662266]\begin{array}{ccc} \begin{bmatrix} 6 & 5 \\ 2 & 6 \end{bmatrix} & \xrightarrow{\text{Nearest Neighbour}} & \begin{bmatrix} 6 & 6 & 5 & 5 \\ 6 & 6 & 5 & 5 \\ 2 & 2 & 6 & 6 \\ 2 & 2 & 6 & 6 \end{bmatrix} \end{array}

Bed of Nails (Upsampling)

Zeroes fill unknown positions.

[6526]Bed of Nails[6050000020600000]\begin{array}{ccc} \begin{bmatrix} 6 & 5 \\ 2 & 6 \end{bmatrix} & \xrightarrow{\text{Bed of Nails}} & \begin{bmatrix} 6 & 0 & 5 & 0 \\ 0 & 0 & 0 & 0 \\ 2 & 0 & 6 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix} \end{array}

Transposed Convolution (Upsampling)

A transposed convolution performs the reverse operation of a standard convolution: it maps each input element to multiple output positions. For each input value, the kernel is multiplied and placed at different output locations.

Transposed convolutions can be understood as matrix multiplication. A standard convolution can be represented as: y=Cx\mathbf{y} = \mathbf{C} \mathbf{x} where C\mathbf{C} is the convolution matrix (Toeplitz matrix). The transposed convolution is: y=CTx\mathbf{y} = \mathbf{C}^T \mathbf{x}. The transpose relationship gives these operations their name, though they are not true inverses - they only approximate inversion through backpropagation.

Example: 2x2 Input, 3x3 Kernel, Stride 1

[abcd][w1w2w3w4w5w6w7w8w9]\begin{array}{cc} \begin{bmatrix} a & b \\ c & d \\ \end{bmatrix} & \begin{bmatrix} w1 & w2 & w3 \\ w4 & w5 & w6 \\ w7 & w8 & w9 \\ \end{bmatrix} \end{array}

Process: Each input value is multiplied by the entire kernel and placed at its output position: aa is multiplied by the full kernel and placed starting at output position (0,0)(0,0), bb is multiplied and placed at (0,2)(0,2) (shifted by stride), cc is multiplied and placed at (2,0)(2,0), dd is multiplied and placed at (2,2)(2,2)

Overlapping regions are summed, producing a 4×44 \times 4 output:

[aw1aw2+bw1aw3+bw2bw3aw4+cw1aw5+bw1+cw4+dw1aw7+cw4cw7cw8+dw7cw9+dw8dw9]\begin{bmatrix} aw_1 & aw_2+bw_1 & aw_3+bw_2 & bw_3 \\ aw_4+cw_1 & aw_5+bw_1+cw_4+dw_1 & \cdots & \cdots \\ aw_7+cw_4 & \cdots & \cdots & \cdots \\ cw_7 & cw_8+dw_7 & cw_9+dw_8 & dw_9 \end{bmatrix}

Training Stability: Residual Connections

Problem: Vanishing/exploding gradients make training very deep networks (e.g., >30 layers) difficult; adding layers can actually degrade performance.

Solution: Residual (Skip) Connections add the original input xx to the output of a layer block F(x)F(x), resulting in y=F(x)+xy = F(x) + x. This allows gradients to flow more easily through “shortcuts”.

CNN Formulae

class depthwise_separable_conv(nn.Module):
  def __init__(self, in_ch: int, out_ch: int,
      kernel_size = 3,
      padding = 1, bias=False):
    super(depthwise_separable_conv, self).__init__()
    self.depthwise = nn.Conv2d(in_ch, in_ch,
        kernel_size, padding=padding, groups=in_ch,
        bias=bias)
    self.pointwise = nn.Conv2d(in_ch, out_ch,
        kernel_size=1, bias=bias)
  def forward(self, x):
    return self.pointwise(self.depthwise(x))
def bottleneck(self, in_ch: int, out_ch: int,
    stride, padding):
    btl_ch = out_ch // 4 # bottleneck channels
    return nn.Sequential(
        nn.BatchNorm2d(in_ch),
        nn.ReLU(True),
        # conv 1x1:
        nn.Conv2d(in_ch, btl_ch, kernel_size=1,
          stride=stride, padding=0),
        # conv 3x3:
        nn.BatchNorm2d(btl_ch),
        nn.ReLU(True),
        nn.Conv2d(btl_ch, btl_ch, kernel_size=3,
          stride=1, padding=padding),
        # conv 1x1:
        nn.BatchNorm2d(btl_ch),
        nn.ReLU(True),
        nn.Conv2d(btl_ch, out_ch, kernel_size=1,
          stride=1, padding=0))

Auto-Encoder (AE)

The Auto-Encoder is a neural network designed to learn efficient data codings in an unsupervised manner. It forces the network to learn the most significant features of the data by compressing it into a lower-dimensional space.

Encoder: Compresses the input xx into a small latent vector zz using convolutional layers and downsampling (e.g., strided convolutions or pooling).

Decoder: Reconstructs the image x^\hat{x} from the latent vector zz using upsampling techniques like Transposed Convolutions or Upsampling layers.

Objective: Minimize the Reconstruction Loss (typically Mean Squared Error) between the input image and the reconstructed output.

Use Case: Dimensionality reduction, feature extraction, and denoising.

class Encoder(nn.Module):
  def __init__(self, latent_dims):
    super(Encoder, self).__init__()
    self.linear1 = nn.Linear(784, 512)
    self.linear2 = nn.Linear(512, latent_dims)

  def forward(self, x):
    x = torch.flatten(x, start_dim=1) # (batch_size, 784)
    x = nn.functional.relu(self.linear1(x))
    return self.linear2(x) # Output latent representation

class Decoder(nn.Module):
  def __init__(self, latent_dims):
    super(Decoder, self).__init__()
    self.linear1 = nn.Linear(latent_dims, 512)
    self.linear2 = nn.Linear(512, 784)

  def forward(self, z):
    z = nn.functional.relu(self.linear1(z))
    z = torch.sigmoid(self.linear2(z)) # Map to [0, 1]
    return z.reshape((-1, 1, 28, 28)) # (batch_size, 1, 28, 28)

class Autoencoder(nn.Module):
  def __init__(self, latent_dims):
    super(Autoencoder, self).__init__()
    self.encoder = Encoder(latent_dims)
    self.decoder = Decoder(latent_dims)

  def forward(self, x):
    z = self.encoder(x)
    return self.decoder(z) # Reconstructed image

Recurrent Architectures

Recurrent Architectures such as RNNs, LSTMs and GRUs process sequences step-by-step using a hidden state HH.

Vanishing Gradients: Long sequences involve multiplying weights WW many times (WnW^n). If W<1W < 1, gradients vanish; if W>1W > 1, they explode. LSTMs (Long-Short-Term-Memory) and GRUs (Gated Recurrent Unit) use “gates” to mitigate this and preserve long-term dependencies. LSTMs face several technical challenges, particularly when dealing with long sequences

Sequential Bottleneck: Because these architectures process data token-by-token, they are slower and more computationally expensive compared to parallel architectures like Transformers.

Memory Loss: On very long sequences, it remains difficult to maintain “long-term dependencies,” meaning the model may lose critical information from the start of the sequence by the time it reaches the end.

Fixed-Vector Bottleneck: In sequence-to-sequence tasks, forcing an entire input into a single fixed-length hidden state creates a bottleneck that limits performance on complex data.

RNN: Recurrent Neural Network

ht=tanh(xtWihT+bih+ht1WhhT+bhh)h_t = \tanh(x_t W_{ih}^T + b_{ih} + h_{t-1} W_{hh}^T + b_{hh})

where hth_t is the hidden state at time tt, xtx_t is the input at time tt, and h(t1)h_{(t-1)} is the hidden state of the previous layer at time t1t-1 or the initial hidden state at time 0.

Concept: Designed for processing sequential data (time series, text, audio) where the order matters and input length varies.

Mechanism: Unlike feed-forward networks, RNNs have loops. They maintain a hidden state (hh) which acts as a short-term memory. At each time step tt, the network takes the current input xtx_t and the previous hidden state ht1h_{t-1} to calculate the output and the new hidden state.

Key Issue (Vanishing Gradients): During backpropagation through time (BPTT), gradients are multiplied repeatedly by the weight matrix (WnW^n). For long sequences, if weights are small, gradients vanish to zero (network stops learning); if large, they explode. This makes standard RNNs bad at learning long-term dependencies.

LSTM: Long Short-Term Memory

it=σ(Wiixt+bii+Whiht1+bhi)ft=σ(Wifxt+bif+Whfht1+bhf)gt=tanh(Wigxt+big+Whght1+bhg)ot=σ(Wioxt+bio+Whoht1+bho)ct=ftct1+itgtht=ottanh(ct)\begin{align} i_t &= \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{t-1} + b_{hi}) \\ f_t &= \sigma(W_{if} x_t + b_{if} + W_{hf} h_{t-1} + b_{hf}) \\ g_t &= \tanh(W_{ig} x_t + b_{ig} + W_{hg} h_{t-1} + b_{hg}) \\ o_t &= \sigma(W_{io} x_t + b_{io} + W_{ho} h_{t-1} + b_{ho}) \\ c_t &= f_t \odot c_{t-1} + i_t \odot g_t \\ h_t &= o_t \odot \tanh(c_t) \end{align}

where hth_t is the hidden state at time tt, ctc_t is the cell state at time tt, xtx_t is the input at time tt, ht1h_{t-1} is the hidden state of the layer at time t1t-1 or the initial hidden state at time 0, and iti_t, ftf_t, gtg_t, oto_t are the input, forget, cell, and output gates, respectively. σ\sigma is the sigmoid function, and \odot is the Hadamard product (element-wise multiplication).

Solution: Designed specifically to fix the vanishing gradient problem and capture long-term dependencies.

Architecture: Introduces a Cell State (CC) alongside the Hidden State. The Cell State acts as a “highway” for information to flow unchanged if needed.

Gates: Uses sigmoid activation “gates” to control information flow:

  1. Forget Gate: Decides what to throw away from the cell state.

  2. Input Gate: Decides what new information to store in the cell state.

  3. Output Gate: Decides what to output based on the cell state and input.

GRU: Gated Recurrent Unit

rt=σ(Wirxt+bir+Whrh(t1)+bhr)zt=σ(Wizxt+biz+Whzh(t1)+bhz)nt=tanh(Winxt+bin+rt(Whnh(t1)+bhn))ht=(1zt)nt+zth(t1)\begin{align} r_t &= \sigma(W_{ir} x_t + b_{ir} + W_{hr} h_{(t-1)} + b_{hr}) \\ z_t &= \sigma(W_{iz} x_t + b_{iz} + W_{hz} h_{(t-1)} + b_{hz}) \\ n_t &= \tanh(W_{in} x_t + b_{in} + r_t \odot (W_{hn} h_{(t-1)} + b_{hn})) \\ h_t &= (1 - z_t) \odot n_t + z_t \odot h_{(t-1)} \end{align}

where hth_t is the hidden state at time tt, xtx_t is the input at time tt, h(t1)h_{(t-1)} is the hidden state of the layer at time t1t-1 or the initial hidden state at time 0, and rtr_t, ztz_t, ntn_t are the reset, update, and new gates, respectively. σ\sigma is the sigmoid function, and \odot is the Hadamard product.

Concept: A simplified, more efficient variation of the LSTM. Architecture: Merges the Cell State and Hidden State into a single state. Combines the Forget and Input gates into a single Update Gate. Adds a Reset Gate to decide how much past information to forget.

Comparison: GRUs have fewer parameters than LSTMs, making them faster to train and often performing just as well on smaller datasets, though LSTMs may be more powerful for very complex tasks.

Transformers

The Attention Mechanism

Attention allows a model to “look back” at all positions of an input sequence simultaneously, solving the bottleneck of single-vector encodings.

The Trinity: QueryQuery (what I am looking for), KeyKey (what I have), and ValueValue (the information content)

Q=XWQ,K=XWK,V=XWVQ = XW^Q, K = XW^K, V = XW^V

Attention Score: The scaled dot-product between QQ and KK (dk:=Dimension of Kd_k := \text{Dimension of } K), the dk\sqrt{d_k} factor keeps variance at 1, ensuring stable gradients regardless of the input dimension dkd_k:

E=QKTdkE = \frac{QK^T}{\sqrt{d_k}}

Attention:

Attention(Q,K,V)=softmax(E)VAttention(Q, K, V) = \text{softmax}\left(E\right)V
softmax(z)i=ezij=1dezjsoftmax: Rd(0,1)d\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^d e^{z_j}} \quad \text{softmax: } \mathbb{R}^d \to (0,1)^d

Transformer Architecture and its Goals

Minimize computational complexity per layer

Minimize path length between pair of words to facilitate learning of long-range dependencies

Maximise the amount of computation that can be parallelized

Self-Attention: QQ, KK, and VV all stem from the same input sequence via different learnable linear transforms.

Multi-Head Attention: Uses multiple sets of (Q,K,V)(Q, K, V) to focus on different aspects of the sequence in parallel.

Positional Encoding: Since self-attention is permutation-invariant (order doesn’t matter), unique sine/cosine vectors are added to input embeddings to inject sequence order.

Decoder Training: In order that the decoder cannot cheat (in parallel training) and look at future tokens, we have to mask them out

Vision Transformer (ViT): Images are split into patches (e.g., 16×1616 \times 16), which are flattened and treated as tokens in a sequence for the transformer encoder.

Vision Transformer Architecture

Reinforcement Learning (RL)

Taxonomy of Reinforcement Learning Algorithms

Reinforcement Learning Notation

AtA_t action at time tt

StS_t state at time tt, typically due, stochastically, to St1S_{t-1} and At1A_{t-1}

RtR_t reward at time tt, typically due, stochastically, to St1S_{t-1} and At1A_{t-1}

π\pi policy (decision-making rule)

π(s)\pi(s) action taken in state ss under deterministic policy π\pi

π(as)\pi(a \mid s) probability of taking action aa in state ss under stochastic policy π\pi

GtG_t return following time tt

vπ(s)v_\pi(s) value of state ss under policy π\pi (expected return)

v(s)v_*(s) value of state ss under the optimal policy

qπ(s,a)q_\pi(s, a) value of taking action aa in state ss under policy π\pi

q(s,a)q_*(s, a) value of taking action aa in state ss under the optimal policy

Reinforcement learning methods specify how the agent’s policy is changed as a result of its experience.

Markov Decision Processes (MDP)

Goal: Maximize the Return (GtG_t), the cumulative (often discounted) reward:

Gt=Rt+1+γRt+2+γ2Rt+3+G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots

RL Key Algorithms

GNN: Graph Neural Networks

Deep learning architectures designed to process data represented as graphs (nodes and edges), such as molecules, social networks, or road maps. Unlike standard CNNs which operate on fixed grids, GNNs must handle irregular structures and varying neighborhood sizes.

Key Property: GNNs must be Permutation Invariant. The output must remain the same regardless of the order in which nodes are indexed in the input matrices.

Approaches:

Spectral Methods: Define convolutions in the frequency domain using the Graph Laplacian (e.g., GCN)

Spatial / Message Passing: Nodes aggregate information (“messages”) from their direct neighbors to update their own features (e.g., GraphSAGE).

Adjacency Matrix

The Adjacency Matrix (AA) is a square N×NN \times N matrix (where NN is the number of nodes) that mathematically represents the graph’s connectivity. Calculation Steps:

  1. Initialize a matrix of zeros with dimensions N×NN \times N.

  2. For every connection (edge) between Node ii and Node jj, set the value at row ii, column jj (AijA_{ij}) to 1.

  3. If the graph is undirected, ensure symmetry: if Aij=1A_{ij} = 1, then Aji=1A_{ji} = 1.

  4. Self-Loops: For many GNNs (like GCN), we add self-loops so a node considers its own features during updates. This means setting the diagonal elements to 1 (Aii=1A_{ii} = 1).

Example: Given a graph with 3 nodes where Node 1 is connected to Node 2, and Node 2 is connected to Node 3 (Note: With self-loops added, the diagonal would become ones):

A=N1N2N3N1010N2101N3010=[010101010]\mathbf{A} = \begin{array}{c|ccc} & N_1 & N_2 & N_3 \\ \hline N_1 & 0 & 1 & 0 \\ N_2 & 1 & 0 & 1 \\ N_3 & 0 & 1 & 0 \\ \end{array} = \begin{bmatrix} 0 & 1 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 0 \end{bmatrix}

GNN for Graph Classification

Graph classification predicts a label for the entire graph (e.g., “Is this molecule toxic?”) rather than for individual nodes. Architecture:

  1. Input: Adjacency Matrix AA and Node Features XX.

  2. GNN Layers (Message Passing): Several layers of graph convolutions update the embeddings for every node based on their neighbors. This captures local structure.

  3. Readout (Global Pooling): A global aggregation step is required to compress all node embeddings into a single Graph Embedding. Common methods include Global Mean Pooling (average of all nodes) or Global Max Pooling.

  4. MLP Classifier: The resulting single graph vector is passed through standard Dense/Linear layers.

  5. Output: A Softmax layer produces the final class probability.

GraphGNN{h1,h2...hN}PoolinghgraphMLPClass\text{Graph} \xrightarrow{GNN} \{h_1, h_2... h_N\} \xrightarrow{Pooling} h_{graph} \xrightarrow{MLP} \text{Class}

Explainable AI Nomenclature

Global vs. Local: Global explanations describe a model’s behavior across an entire dataset, while local explanations justify a specific prediction for a single instance.

Intrinsic vs. Post-hoc: Intrinsic models, also called glassbox or white-box models, are transparent by design (e.g., decision trees), whereas post-hoc methods explain “black-box” models after they have been trained.

Model-Agnostic vs. Model-Specific: Model-agnostic techniques can be applied to any architecture because they only use inputs and outputs, while model-specific methods rely on internal components like convolutional feature maps.

Surrogate Models: These are simple, interpretable models trained to approximate a complex model’s behavior within a localized parameter space.

Attributions and Saliency Maps: These terms refer to visualizations, often heatmaps, that highlight which parts of an input most influenced a specific model decision.

Counterfactuals: These provide “what-if” scenarios by identifying the smallest change needed in an input to flip the model’s output.

XAI Challenges

The primary challenges of XAI stem from the fundamental tension between model complexity and transparency: as models become more accurate, they generally become more opaque.

The Missingness Problem: It is difficult to define “nothingness” for a neural network without biasing the output. For example, using a black baseline (zeros) may be highly meaningful in certain data types, like sketches.

The Correlation Problem: Many methods assume features are independent, which can lead to non-physical explanations that do not respect the data’s true distribution.

Saturation and Vanishing Gradients: In well-trained models, the output can be so certain that small changes to individual pixels result in near-zero gradients, making sensitivity-based explanations noisy.

Computational Expense: Techniques like occlusion require thousands of forward passes, while calculating exact Shapley values is mathematically intractable due to the 2n2^n possible feature combinations.

Resolution vs. Semantics: Model-specific methods like CAM and Grad-CAM face a trade-off where deeper layers provide high-level semantic meaning but suffer from low spatial resolution because the feature maps have shrunk.

Sensitivity to Hyperparameters: Explanations can vary drastically based on user-selected values, such as the kernel width in LIME or the chosen baseline in Integrated Gradients.

Glassbox Models

Glassbox or “white box” models are models that are intrinsically interpretable, not requiring additional techniques to understand their decisions. Common examples include linear models, where weights directly represent feature importance, and decision trees. Although these models often sacrifice accuracy for transparency, they serve as critical components in more advanced XAI methods like LIME and CAM.

XAI Plot Based Methods

Individual Conditional Expectation (ICE or ICPs) is a local explanation method that visualizes how a model’s prediction changes for a specific instance as one feature is varied,. While Partial Dependence Plots (PDPs) show an overall average effect across a dataset, ICE plots a separate curve for each instance to reveal variations that might be hidden by aggregation,.

The procedure involves selecting a feature and sweeping its unique values for a single row while keeping all other feature values constant,. This allows researchers to identify heterogeneous effects, such as subsets of data that react differently to feature changes compared to the global average.

PDP: Partial Dependency Plot

First order approximation of feature dependence Assumes features are uncorrelated.

PDP
  1. Select feature index to analyze (i=2i=2)

  2. Find unique values for this feature

  3. Order values of this feature in ascending order

  4. Set feature i0i_0 for all instances and calculate the average

  5. Repeated this for i1i_1 to i_n tracing out a curve

Vectorize by replacing entire column with single feature value

Problems: Examples might not be physical because features are most likely not independent. Also taking the marginal contribution over the entire dataset is likely to hide interesting cases.

ICE: Individual Conditional Expectation Plot

Second order approximation of feature dependence.

ICE Plot
  1. Select feature index to analyze (i=2i=2)

  2. Find unique values for this feature

  3. Order values of this feature in ascending order

  4. Take single instance and calculate BB-model output varying feature i over all its values to trace out a curve

  5. Repeated for sub subset

Vectorize by duplicating for entire matrix insert column of unique values

Saliency Mapping

Saliency mapping (also known as heatmap or attribution mapping) identifies which parts of an input were most influential in a model’s specific prediction.

Occlusion and Adaptive Occlusion

Occlusion is a simple, model-agnostic method that identifies critical regions by “blacking out” portions of an input and measuring the change in the model’s output (Adaptive Occlusion attempts to minimize the occluded area while maintaining the same prediction output w/o occlusion). It produces a saliency map where regions that cause the largest prediction swings are highlighted as important. Its main drawbacks include high computational costs and extreme sensitivity to hyperparameters like kernel size and the chosen baseline value (e.g., black vs. noise).

GAP: Global Average Pooling

Global Average Pooling (GAP) is a dimensionality reduction technique that computes the average value across all spatial dimensions of a feature map, reducing a 3D tensor to a 1D vector. It replaces fully connected layers at the end of convolutional neural networks.

Mathematical Definition: For a feature map with shape (C,H,W)(C, H, W) where CC is the number of channels, HH is height, and WW is width:

zc=1H×Wi=1Hj=1Wfc[i,j]z_c = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} f_c[i, j]

where zcz_c is the output for channel cc, and fc[i,j]f_c[i,j] is the value at position (i,j)(i,j) in channel cc.

Output shape: (C,)(C,) — a single scalar value per channel.

CAM: Class Activation Mapping

CAM is a model-specific: It requires a specific architecture (Global Average Pooling (GAP) followed by a linear layer). The convolutional feature maps are processed through Global Average Pooling (GAP) to create a linear model just before the final prediction. By multiplying these feature maps by their corresponding class weights and summing them (often using a ReLU), the model generates a heat map of the discriminant regions. While semantic quality improves in deeper layers, the resolution of these explanations tends to decrease as the maps get smaller.

Grad-CAM

Is more flexible than standard CAM because it uses gradients to weigh the importance of feature maps rather than requiring a specific architecture. Importance is assessed by “wiggling” individual feature maps and measuring the sensitivity of the output.

Frees up architecture constraints; Don’t have to compromise accuracy as final conv layer can go into any function, not just a SoftMax; Final feature map resolution dictates the graduality of the explanation; Tension between high level semantics and explanatory resolution; Cannot compare the intensity of final heatmaps between different instances; Still requires CNN.

LIME: Local Interpretable Model-Agnostic Explanations

LIME is a model-agnostic method that provides “local” explanations for individual instances, working across tabular, text, and image data. It operates by randomly sampling data around a specific prediction and weighting those samples using a proximity measure - typically an exponential kernel exp(D(x,z)2/σ2)\exp(-D(x,z)^2 / \sigma^2) - so that points closer to the original input are more important. A simple, interpretable surrogate model (such as a linear model) is then trained to mimic the black-box model’s behavior in that local neighborhood.

Basic Idea of LIME

Basic Idea of LIME

Upper-left: Data with two features, x1 and x2. Classifier f labels class 1 blue and class 0 grey. Steps of LIME

Upper-right: We want to explain the yellow data point. Generate new samples (black dots).

Lower-left: Weigh the dots such that dots closer to the yellow point are more important.

Lower-right: Train a surrogate white box model g to label the new data points such that it behaves like our black box model (locally)

Left: Blackbox-Model, Right: Surrogate Model, Bottom: Enforces locality for predictions: If the distance is large, it should be less relevant if the prediction is different.

Left: Blackbox-Model, Right: Surrogate Model, Bottom: Enforces locality for predictions: If the distance is large, it should be less relevant if the prediction is different.

  1. We want to explain the black box model f. Particularly it decision for the central orange point in the original unprimed data space. This point a sentence, row, or an image.

  2. Transport the data point into a binary representation shown here as the dark blue central point in the primed data space.

  3. Randomly sample data from the binary representation to generate a z’ dataset (other blue points).

  4. Map the new sampled dataset back to the original image space such that we have an unprimed dataset z.

  5. Train a white box model g on the primed dataset such that it has high fidelity, i.e., behaves the same as the black box model f when labelling the new dataset. Loss function can be seen at the bottom of the plot.

  6. Use a proximity measure that places more weight on points similar to the original data point we want to explain (exponential kernel).

Local vs. Global Explanations

Local: Explains a specific instance (e.g., “Why was this loan denied?”).

Global: Summarizes feature importance across the entire dataset.

SHAP: Shapley Values

Based on game theory: how to fairly distribute the “payout” (prediction) among “players” (features).

Axiom of Completeness: The sum of all feature attributions must equal the model’s prediction minus the baseline.

Integrated Gradients: A path-integral method that integrates gradients along a line from a baseline (e.g., black image) to the target image to solve the saturation problem (where gradients go to zero in trained models).

Generative AI

Explicit vs. Implicit Density

Explicit: Models the probability P(x)P(x) directly (e.g., PixelRNN, VAE).

Implicit: Learns to sample from the distribution without explicitly calculating P(x)P(x) (e.g., GANs).

VAE: Variational Auto-Encoder

The VAE is a generative model that extends the AE structure to estimate the probability density of the data. Instead of mapping an input to a fixed point in the latent space, it maps it to a probability distribution. Optimizes the Evidence Lower Bound (ELBO). The Encoder approximates the posterior q(zx)q(z|x), and the KL divergence forces it close to a Gaussian prior.

Latent Space: The encoder outputs parameters for a distribution (Mean μ\mu and Variance σ\sigma) rather than a single vector. The system then samples zz from this distribution to feed the decoder.

Smoothness: The latent space is continuous, allowing for valid interpolation between points (e.g., morphing one image into another), unlike standard AEs.

Trade-off: VAEs are easy to train and fast to sample from, but often produce blurrier images compared to GANs,.

Loss Function (ELBO): The VAE optimizes the Evidence Lower Bound (ELBO), which consists of two competing terms:

  1. Reconstruction Loss: Ensures the output resembles the input (Likelihood).

  2. KL Divergence: A regularization term that forces the learned latent distribution q(zx)q(z|x) to approximate a standard Gaussian prior p(z)p(z) (typically N(0,1)\mathcal{N}(0,1)).

L=Eq[logp(xz)]DKL(q(zx)p(z))\mathcal{L} = \mathbb{E}_{q}[\log p(x|z)] - D_{KL}(q(z|x) || p(z))

Equation Insight: If the encoder models a Gaussian distribution, the KL term has a simple analytic expression involving μ\mu and σ\sigma:

DKL(q(zx)p(z))=12j=1M(1+ln(σj2x)μj2xσj2x)D_{KL}(q(z|x) || p(z)) = \frac{1}{2} \sum_{j=1}^{M} \left( 1 + \ln(\sigma_j^2x) - \mu_j^2x - \sigma_j^2x \right)
  • Inputs: The encoder network directly outputs the mean (μ\mu) and variance (σ2\sigma^2) for each latent dimension jj.

  • Result: This formula penalizes the network if:

    • μ\mu diverges from 0 (the μ2-\mu^2 term).

    • σ2\sigma^2 diverges from 1 (the 1+ln(σ2)σ21 + \ln(\sigma^2) - \sigma^2 terms).

This insight refers to a computational shortcut that makes training Variational Autoencoders (VAEs) efficient.

Reparameterization Trick:

The Problem: In a standard VAE, the encoder outputs the parameters of a distribution (mean μ\mu and variance σ\sigma). The network must then sample a latent vector zz from this distribution to feed the decoder. This sampling operation is stochastic (random). Standard backpropagation cannot compute gradients through a random node, meaning the encoder’s weights cannot be updated to minimize the error.

The Solution: The reparameterization trick solves this by expressing the random variable zz as a deterministic function of the model parameters and an independent source of noise. Instead of sampling zz directly from N(μ,σ2)\mathcal{N}(\mu, \sigma^2), the model:

  1. Samples a noise vector ϵ\epsilon from a fixed standard normal distribution: ϵN(0,1)\epsilon \sim \mathcal{N}(0, 1)

  2. Calculates zz using the deterministic formula:

    z=μ+σϵz = \mu + \sigma \cdot \epsilon

The randomness is now contained in ϵ\epsilon, which is treated as a constant input during backpropagation. The latent vector zz is now a differentiable function of μ\mu and σ\sigma, allowing gradients to flow from the decoder back into the encoder.

GAN: Generative Adversarial Network

The MinMax Game: A Generator creates fake images from noise, and a Discriminator tries to distinguish them from real data.

Minmax Loss:

minGmaxDV(D,G)=Expdata[logD(x)]+Ezpz[log(1D(G(z)))]\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]

WGAN: Uses the Earth Mover (Wasserstein) distance, to provide a stable gradient even if the generator and data distributions do not overlap.

CycleGAN: Learns style transfer between two domains without paired data (e.g., horses to zebras) using a Cycle Consistency Loss (F(G(x))xF(G(x)) \approx x) to allow style transfer without paired training data.

SSL: Self-Supervised Learning

Pretext Tasks

The model solves “fake” tasks to learn a backbone representation.

Examples: Predicting image rotation, solving jigsaw puzzles, or colorization.

Success is measured by performance on downstream tasks (e.g., classification) using only a small amount of labeled data.

SimCLR: Contrastive Learning

Method: Create two augmented versions of the same image (positive pair) and maximize their similarity while minimizing similarity with all other images in the batch (negative pairs).

Order of Operations: Data Augmentation \to Encoder \to Projection Head \to Contrastive Loss.

Projection Head: A small MLP that “absorbs” the contrastive loss, allowing the encoder to maintain general features.

BYOL: Bootstrap Your Own Latent

Learns representations without negative examples.

Uses two asymmetric networks (Online and Target). The Target weights are an Exponential Moving Average (EMA) of the Online weights, preventing the model from collapsing to a trivial constant output and negative pairs.