Basic Concepts

Neural networks types and when to use them¶

Deep learning offers several standard network types, each with typical application domains.

Key Takeaways¶

Architecture choice depends on input structure:
- MLPs for vector features,
- CNNs for grid-like data,
- RNNs/Transformers for sequences,
- GNNs for general graphs.
MLPs are universal function approximators, but dense and parameter-heavy for structured inputs.
CNNs leverage convolution:
- Weight sharing, sparse connections, and receptive fields enable efficient representation learning on grids.
A rich set of design patterns make CNNs more effective and efficient:
- Downsampling: pooling, strided convs,
- Normalization: batch norm,
- Residual connections for deep networks,
- Upsampling: interpolation, unpooling, transposed convs,
- Bottlenecks and small kernels to reduce parameters,
- Multi-scale filters (Inception),
- Separable and depthwise separable convs,
- Multi-head architectures for multi-task learning,
- Sparsity and pruning to reduce cost.
Classic architectures (AlexNet, VGG, GoogLeNet, ResNet) instantiate these ideas in different ways and form the foundation for many modern deep learning models.

Multi-Layer Perceptrons (MLPs / dense networks)

Inputs and outputs are usually fixed-dimensional vectors.
Every neuron in one layer is connected to every neuron in the next.
Typical use cases:
- Tabular data (e.g. apartment features → rent),
- Final classification head after feature extraction,
- Generic feature transformation and dimensionality reduction.

Convolutional Neural Networks (CNNs)

Exploit local structure and translation invariance.
Typical input types:
- 1D signals: audio, ECG, vibration signals, spectral data,
- 2D signals: images, game boards, card layouts,
- 3D data: volumetric medical scans, videos (2D + time), etc.
Standard choice for:
- Image classification, detection, segmentation,
- Many generative image models,
- Any data that live on a grid.

Recurrent Neural Networks (RNNs) and sequence models

Process sequences step by step (text, time series, signals).
Historically used for:
- Language modeling (e.g. predicting the next word),
- Speech recognition,
- Sequential prediction tasks.

(Modern sequence models are often Transformers, covered in later lectures.)

Graph Neural Networks (GNNs)

Neural networks on graph-structured data.
CNNs can be viewed as a special case where the underlying graph is a regular grid.
Typical use cases:
- Molecules and materials,
- Networks and relational data.

The choice of architecture is mainly driven by:

The structure of the input (vector, sequence, grid, graph),
The symmetries and invariances we want to exploit (e.g. translation invariance for images).

Multi-Layer Perceptrons (MLPs)¶

An MLP (multi-layer perceptron) is a stack of fully connected layers.

For a single neuron with inputs $x \in \mathbb{R}^n$ :

Weights $w \in \mathbb{R}^n$ ,
Bias $b \in \mathbb{R}$ ,
Activation function $\sigma$ (e.g. ReLU, sigmoid, tanh),

the output is

y = \sigma\!\left(\sum_{k=1}^n w_k x_k + b\right) = \sigma(w^\top x + b).

(1)

A layer applies this operation to each neuron in parallel; stacking layers gives a deep network:

h^{(1)} = \sigma\big(W^{(1)} x + b^{(1)}\big), \quad h^{(2)} = \sigma\big(W^{(2)} h^{(1)} + b^{(2)}\big), \quad \dots

(2)

Universal approximation theorem¶

The slides recall the universal approximation theorem:

A feed-forward network with a single hidden layer and a suitable non-linear activation function can approximate any continuous function on a compact subset of $\mathbb{R}^n$ , given enough hidden units.

Symbolically,

f(x) \approx \sum_{i=1}^{N(\varepsilon)} a_i\, \sigma(w_i \cdot x + b_i),

(3)

where $N(\varepsilon)$ is large enough so that the approximation error is below $\varepsilon$ .

This does not mean:

Any network is automatically good;
It says that, in principle, capacity is sufficient, but training and generalization are separate issues.

Typical uses of MLPs¶

Problems where inputs are already meaningful features (e.g. engineered descriptors, small vectors),
Final stages of larger networks:
- After CNN feature extraction (for classification),
- After attention/Transformer layers,
As generic non-linear feature transforms inside larger architectures.

Convolutional Neural Networks (CNNs)¶

CNNs are built from convolutional layers, which apply learnable filters across the input.

Convolution as an operation¶

Continuous-time convolution of functions $f$ and $g$ :

(f * g)(t) = \int_{-\infty}^{\infty} f(\tau)\,g(t - \tau)\,d\tau.

(4)

Discrete 1D convolution:

(f * g)[n] = \sum_{m=-M}^{M} f[n - m]\, g[m].

(5)

2D convolution for images (with kernel $\omega$ and image $f$ ):

g(x,y) = (\omega * f)(x,y) = \sum_{i=-a}^{a} \sum_{j=-b}^{b} \omega(i,j)\, f(x - i, y - j).

(6)

In many deep learning libraries, the implemented operation is actually cross-correlation (no kernel flip):

g(x,y) = \sum_{i=-a}^{a} \sum_{j=-b}^{b} \omega(i,j)\, f(x + i, y + j).

(7)

Since the kernel $\omega$ is learned, the distinction does not matter in practice.

Key properties of CNNs¶

Weight sharing:
- The same filter is applied at all spatial positions.
- Greatly reduces the number of parameters.
Sparse connectivity:
- Each output location depends only on a local neighborhood in the input,
- Defined by the kernel size (e.g. $3 \times 3$ , $5 \times 5$ ).
Receptive field:
- The region of the input that can influence a given activation.
- Increases with depth: deeper layers see larger and more abstract regions.

Multiple channels¶

For multi-channel inputs (e.g. RGB images), each filter has a separate kernel per input channel; outputs are summed:

Input shape: $(\text{channels}_\text{in}, H, W)$ ,
Kernel shape: $(\text{channels}_\text{out}, \text{channels}_\text{in}, k_H, k_W)$ ,
Each output channel is a sum over input channels convolved with its corresponding kernels.

CNNs as representation learning¶

CNNs automatically learn feature detectors:

Early layers: edges, colors, simple patterns,
Intermediate layers: textures, parts (eyes, wheels, etc.),
Deeper layers: object-level features.

Pipeline view:

Input → Feature extraction (convolutional stack) → Feature classification (MLP head) → Output

CNNs can also be seen as a special case of graph neural networks, where the underlying graph is the pixel grid.

Controlling spatial resolution: pooling and strided convolutions¶

As data flows through a CNN, we often reduce spatial size while increasing the number of channels.

Why reduce spatial size?¶

To aggregate information from larger regions,
To reduce computational cost and memory usage,
To design encoder–decoder architectures (encoder compresses, decoder expands).

Pooling¶

Max pooling and average pooling operate on local windows (e.g. $2 \times 2$ , $3 \times 3$ ):

Max pooling:
- Takes the maximum in each window.
- Emphasizes strong activations and introduces some invariance to small translations.
Average pooling:
- Takes the average value.
- Smooths activations; less commonly used in modern deep CNNs than max pooling or strided convolutions.

Pooling reduces resolution by the stride of the pooling window (e.g. stride 2 halves height and width).

Strided convolutions¶

Instead of separate pooling layers, we can use strided convolutions:

The filter moves by more than one pixel at a time (stride > 1),
This simultaneously performs feature extraction and downsampling.

Strided convolutions and pooling are both used to control spatial resolution in many architectures.

Normalization and batch normalization¶

Deep networks can be hard to train because the distribution of activations in each layer changes during training.

This leads to:

Slower convergence,
Sensitivity to initialization,
More difficult tuning of learning rates.

Batch normalization¶

Batch normalization (BatchNorm) normalizes activations within a mini-batch and then applies a learned affine transform.

For each channel $c$ :

Compute mini-batch mean and variance:
$\mu_c = \frac{1}{m} \sum_{b,x,y} I_{b,c,x,y}, \qquad \sigma_c^2 = \frac{1}{m} \sum_{b,x,y} (I_{b,c,x,y} - \mu_c)^2,$
(8)
where $I_{b,c,x,y}$ is the input activation, and $m$ is the number of elements in the batch for that channel.
Normalize and scale/shift:
$O_{b,c,x,y} = \gamma_c \frac{I_{b,c,x,y} - \mu_c}{\sqrt{\sigma_c^2 + \varepsilon}} + \beta_c,$
(9)
with learnable parameters $\gamma_c$ and $\beta_c$ .

Effects:

Stabilizes and accelerates training,
Often allows higher learning rates,
Acts as a regularizer and can improve generalization.

At inference time, running averages of $\mu_c$ and $\sigma_c^2$ are used instead of batch statistics.

Residual connections¶

When networks become very deep, plain stacks of layers are hard to train:

Gradients can vanish or explode,
Adding more layers can hurt performance, even if, in principle, a deeper network should be at least as good.

Residual blocks¶

Residual networks (ResNets) introduce skip connections:

Instead of learning a mapping $H(x)$ directly, the network learns a residual function $F(x)$ such that

H(x) = F(x) + x.

(10)

A residual block computes:

$F(x)$ via a small stack of layers (e.g. conv → BN → ReLU → conv → BN),
Adds the input: $y = F(x) + x$ ,
Applies a non-linearity (often ReLU) after the addition.

Benefits:

The network can easily learn the identity mapping (just set $F(x) \approx 0$ ),
Gradients can propagate more directly through the skip connection,
Enables training of very deep networks (e.g. 50, 101, 152 layers and beyond).

Variants:

Identity skip connections when input and output shapes match,
Projection (e.g. $1 \times 1$ conv) in the skip path to match dimensions,
Bottleneck blocks (with $1 \times 1$ convs) in deeper ResNets.

Increasing spatial resolution: upsampling and transposed convolutions¶

Encoder–decoder architectures (e.g. for segmentation, image generation, inpainting) must increase spatial resolution in later stages.

Upsampling¶

Common upsampling methods:

Nearest neighbor:
- Each input pixel is replicated into a larger block (e.g. $2 \times 2$ ),
- Simple and fast; can produce blocky outputs.
Bed of nails:
- Insert zeros between input pixels, then optionally apply a convolution,
- Separates the “increase resolution” step from “learned filtering”.
Interpolation-based upsampling (bilinear, bicubic):
- Smooth interpolations; often followed by a convolution layer.

Max unpooling¶

If max pooling was used in the encoder:

The positions of the max elements are stored,
Max unpooling puts the pooled value back into its original location in a larger map, filling other entries with zeros,
This can help preserve spatial structure (e.g. edges or boundaries).

Transposed convolutions¶

Transposed convolutions (sometimes called “deconvolutions”) implement learned upsampling:

Conceptually, they invert the shape change of a regular convolution with stride > 1,
Implementation:
- Spread each input value into a larger output region via the filter,
- Overlapping contributions are summed.

Properties:

Learnable filters, similar to standard convolutions,
Can increase both spatial resolution and adjust number of channels,
Widely used in encoder–decoder CNNs and generative models.

Bottleneck layers and channel reduction¶

Deep CNNs often use many channels, which can make later layers very expensive:

A convolution with kernel size $k \times k$ , $C_\text{in}$ input channels and $C_\text{out}$ output channels has $k^2 C_\text{in} C_\text{out}$ parameters.

If $C_\text{in}$ and $C_\text{out}$ are both large, this is costly.

Bottleneck idea¶

Use $1 \times 1$ convolutions to reduce (or expand) the number of channels, forming a bottleneck:

Compress channels: $C_\text{in} \to C_\text{mid}$ with a $1 \times 1$ conv,
Apply a more expensive $k \times k$ conv with fewer channels ( $C_\text{mid}$ ),
Optionally expand back: $C_\text{mid} \to C_\text{out}$ with another $1 \times 1$ conv.

Example comparison from the slides:

Direct $3 \times 3$ conv, $256 \to 256$ channels:
$\text{params} = (3 \cdot 3 \cdot 256 + 1) \cdot 256 \approx 590{,}000.$
(11)
Bottleneck: $256 \to 64 \to 64 \to 256$ using $1 \times 1$ and $3 \times 3$ convs:
$(1 \cdot 1 \cdot 256 + 1) \cdot 64 + (3 \cdot 3 \cdot 64 + 1) \cdot 64 + (1 \cdot 1 \cdot 64 + 1) \cdot 256 \approx 70{,}000.$
(12)

So the bottleneck block has far fewer parameters for a similar receptive field.

Bottlenecks are a core component of:

Deeper ResNets (e.g. ResNet-50, ResNet-101),
Inception-style architectures (for dimensionality reduction before expensive convolutions).

Deeper networks with small kernels¶

Another way to reduce parameters while keeping a large receptive field is to replace large kernels with stacks of small kernels.

Example:

A single $11 \times 11$ convolution has kernel size $11 \cdot 11 = 121$ per input–output channel pair.
Three stacked $3 \times 3$ convolutions:
- Each kernel is $3 \cdot 3 = 9$ parameters,
- Three layers give a receptive field of $7 \times 7$ or larger, depending on padding and stride,
- More layers = more non-linearities and representation power.

More generally:

Several small kernels can approximate the effect of a large kernel at lower cost,
Deep stacks of $3 \times 3$ convs (as in VGG) have become a common pattern.

Benefits:

Increased depth (more expressiveness),
Fewer parameters,
More regular structure, easier to tune and implement.

Multi-scale filters and Inception-style modules¶

Classical image processing often uses filters at multiple scales to detect different types of features.

In CNNs, using a single kernel size (e.g. $3 \times 3$ ) might miss patterns that are best captured by larger or smaller receptive fields.

Inception modules¶

Inception modules apply multiple filter sizes in parallel and then concatenate the results:

Branches might include:
- $1 \times 1$ convolutions,
- $3 \times 3$ convolutions,
- $5 \times 5$ convolutions,
- Max pooling with projection.

Outputs from all branches are concatenated along the channel dimension.

To control the number of parameters, Inception uses dimensionality reduction:

$1 \times 1$ convolutions before $3 \times 3$ and $5 \times 5$ to reduce the number of incoming channels,
This acts as a cheap bottleneck before expensive convolutions.

Ideas behind Inception:

Approximate a sparse optimal structure with a dense but efficient module,
Process visual information at multiple scales simultaneously,
Use dimensionality reduction to keep computation manageable.

GoogLeNet (Inception v1) and its successors (Inception v3, etc.) are built by stacking such modules.

Separable filters and depthwise separable convolutions¶

Large 2D kernels (e.g. $7 \times 7$ ) can be decomposed into more efficient forms.

Spatially separable filters¶

A 2D kernel is separable if it can be written as an outer product of two 1D kernels.

Example from the slides:

Smoothing filter: $$ \frac{1}{3}
$\begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}$
(13)
- \frac{1}{3} [1 \ 1 \ 1] = \frac{1}{9}
$\begin{bmatrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix}$
(14)
$$
Edge filter: $$
$\begin{bmatrix} 1 \\ 2 \\ 1 \end{bmatrix}$
(15)
- [1 \ 0 \ -1] =
$\begin{bmatrix} 1 & 0 & -1 \\ 2 & 0 & -2 \\ 1 & 0 & -1 \end{bmatrix}$
(16)
$$

Instead of a single $7 \times 7$ filter (49 parameters), we can use a $7 \times 1$ followed by a $1 \times 7$ filter (14 parameters), achieving the same effect when the kernel is separable.

Depthwise separable convolutions¶

Standard convs mix spatial and cross-channel correlations in one operation.

Depthwise separable convolutions decouple these steps:

Depthwise convolution:
- Convolve each input channel independently with its own spatial kernel (e.g. $3 \times 3$ ),
- Produces the same number of channels as input.
Pointwise convolution:
- Apply a $1 \times 1$ convolution across channels to mix them,
- Changes the number of channels (e.g. from $C_\text{in}$ to $C_\text{out}$ ).

This is similar in spirit to an “extreme” Inception module but applied per channel first, then mixing.

Benefits:

Significant reduction in computation and parameters,
Widely used in efficient models (e.g. MobileNet, Xception).

Example PyTorch implementation from the slides:

import torch.nn as nn

class DepthwiseSeparableConv(nn.Module):
    def __init__(self, nin, nout, kernel_size=3, padding=1, bias=False):
        super().__init__()
        self.depthwise = nn.Conv2d(
            nin, nin, kernel_size=kernel_size,
            padding=padding, groups=nin, bias=bias
        )
        self.pointwise = nn.Conv2d(
            nin, nout, kernel_size=1, bias=bias
        )

    def forward(self, x):
        out = self.depthwise(x)
        out = self.pointwise(out)
        return out

Multi-head networks and shared feature representations¶

Sometimes we want a network to solve multiple tasks simultaneously while sharing most of its computation.

Examples from the slides:

AlphaZero:
- Shared CNN backbone → separate heads for policy (next move) and value (position evaluation).
Multi-head physics-informed networks (PINNs):
- Shared representation → different heads for different physical quantities or conditions.
Card game models (e.g. Jass):
- Shared representation → heads for policy, value, and card distribution.

General pattern¶

A shared stem (backbone) processes the input into a feature representation,
Multiple heads branch off, each with its own layers and loss function.

Advantages:

Parameter sharing:
- Reduced total number of parameters,
- Faster training and inference.
Multi-task learning:
- Tasks can regularize each other,
- Shared features can generalize better.

Training:

Each head has its own loss $L_k$ ,
The total loss is often a weighted sum:
$L_\text{total} = \sum_k \lambda_k L_k,$
(17)
with task-specific weights $\lambda_k$ .

Sparsity and pruning in deep networks¶

Modern networks can be very large, but in many cases only a fraction of the parameters are truly needed.

Why sparsity?¶

Reduce computational cost and memory footprint,
Enable deployment on constrained hardware,
Potentially improve interpretability.

Inducing sparsity¶

Approaches include:

$L_1$ regularization on weights:
- Encourages many weights to be close to zero,
- Small-magnitude weights can then be pruned (set exactly to zero).
Structured pruning:
- Remove entire filters, channels, or blocks,
- Often easier to implement efficiently than unstructured sparsity.

The slides highlight that:

There is often an Occam’s hill: test error vs sparsity first improves (as redundant parameters are removed), then worsens once too much capacity is pruned.

Key takeaway:

Carefully introduced sparsity can make networks cheaper and sometimes even more accurate, but excessive pruning degrades performance.

Classical CNN architectures and their design ideas¶

Several influential CNN architectures illustrate the design patterns discussed above.

AlexNet¶

Early success on ImageNet (2012).
Architecture:
- Large initial $11 \times 11$ convolution with stride 4,
- Multiple conv and pooling layers,
- Two large fully connected layers (4096 units each),
- Output layer for 1000 ImageNet classes.
Key ideas at the time:
- Use of ReLU activations,
- Data augmentation,
- Dropout to reduce overfitting,
- GPU training.

VGG¶

Uses only $3 \times 3$ convolutions throughout the network.
Stacks many such layers to increase depth:
- Simple, uniform architecture,
- Demonstrated that deeper networks with small kernels perform very well.
Heavy on parameters due to large fully connected layers at the end.

GoogLeNet / Inception¶

Introduced Inception modules with multiple kernel sizes ( $1 \times 1$ , $3 \times 3$ , $5 \times 5$ ) and pooling in parallel.
Uses $1 \times 1$ convolutions for dimensionality reduction.
Overall architecture:
- Inception modules stacked with occasional pooling,
- Auxiliary classifiers (extra heads) during training to help gradients,
- Less reliance on large fully connected layers.

ResNet¶

Introduces residual connections (skip connections) to allow very deep networks.
Variants:
- Plain ResNets with basic blocks,
- Bottleneck ResNets using $1 \times 1$ convs around $3 \times 3$ convs.
Eliminates many intermediate pooling layers and relies on strides and residual blocks.
Scales to very deep networks (e.g. 152 or even 1000+ layers).

These architectures combine:

Convolutional feature extraction,
Downsampling and upsampling strategies,
Normalization and residual connections,
Bottleneck and multi-scale modules,

and they motivate many of the modern design choices in current CNN-based models.