Vision Transformers

Vectorized self-attention in the encoder¶

Self-attention can be written in a compact, vectorized form.

Given a sequence of embeddings stacked in a matrix $X \in \mathbb{R}^{T \times d}$ (each row is a token embedding):

Compute queries, keys, and values:
$Q = X W^Q, \quad K = X W^K, \quad V = X W^V,$
(1)
where $W^Q, W^K, W^V \in \mathbb{R}^{d \times d_k}$ .
Compute attention scores:
$E = Q K^\top \in \mathbb{R}^{T \times T}.$
(2)
Apply softmax row-wise:
$A = \text{softmax}(E),$
(3)
Compute the output:
$\text{Output} = A V.$
(4)

This is often written as:

\text{Attention}(Q,K,V) = \text{softmax}(QK^\top)\,V.

(5)

In scaled dot-product attention, we divide by $\sqrt{d_k}$ :

\text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V,

(6)

which improves numerical stability when $d_k$ is large.

Self-attention can be viewed as a learned, differentiable key–value lookup where each query selects a weighted combination of values based on similarity to keys.

ConvNets vs Transformers (conceptual comparison)¶

The slides highlight high-level differences between convolutional networks and transformers.

Convolutional networks (CNNs)

Operate on grid-structured inputs (e.g. images).
Use local filters and weight sharing across spatial positions.
Implicitly enforce translation invariance:
- Convolution kernels depend only on relative position within a local neighborhood.
Build large receptive fields by:
- Stacking many layers,
- Using pooling or strided convolutions to downsample.

Transformers

Use self-attention to connect all positions:
- Any token can attend to any other in one step (global receptive field).
Do not have built-in translation invariance:
- Use positional encodings instead of relative positions in the kernel.
Are highly parallelizable across positions.

For images, the question is:

Can we treat an image as a sequence and apply transformers directly, without convolutions?

Vision Transformer (ViT): main idea¶

The core idea of the Vision Transformer is to treat an image as a sequence of patches and apply a standard transformer encoder for classification.

Conceptually:

Split image into patches:
- Input image size: $H \times W \times C$ .
- Choose patch size: $P \times P$ (e.g. $16 \times 16$ ).
- The image is reshaped into $N = \frac{HW}{P^2}$ patches, each of size $P^2 C$ .
Flatten patches:
- Each patch is flattened into a vector $x_i \in \mathbb{R}^{P^2 C}$ .
Linear projection:
- Each patch vector is mapped to a $D$ -dimensional embedding:
  $z_i^0 = E_{\text{patch}} x_i \in \mathbb{R}^D.$
  (7)
Class token:
- Prepend a learnable embedding $z_{\text{class}}^0$ to the sequence.
- Its final representation after the transformer encoder is used as the image representation for classification.
Positional embeddings:
- Add a learnable 1D positional embedding $p_i$ to each patch (and class) embedding:
  $z_i^0 \leftarrow z_i^0 + p_i.$
  (8)
Transformer encoder:
- Apply a standard transformer encoder (stack of multi-head self-attention + MLP blocks) to the sequence $[z_{\text{class}}^0, z_1^0,\dots,z_N^0]$ .
Classification head:
- Take the final class token $z_{\text{class}}^L$ from the top encoder layer.
- Feed it into an MLP classifier:
  - Often a small MLP with one hidden layer during pretraining,
  - Possibly a single linear layer for fine-tuning.

In short:

ViT = Patches → Linear embeddings + class token → Transformer encoder → MLP head.

Vision Transformer architecture in more detail¶

The slides summarize ViT with the following components:

Patch + position embedding:
- Image is divided into patches and each is linearly projected to dimension $D$ .
- A learnable class token is prepended.
- Position embeddings are added to each token (including class token).
Transformer encoder (repeated $L$ times):
- LayerNorm → Multi-head self-attention → residual connection,
- LayerNorm → MLP (position-wise feed-forward) → residual connection.
MLP head:
- Takes the final representation of the class token,
- Outputs class logits (bird, ball, car, ...).

Symbolically, for layer $\ell$ :

Self-attention sublayer:
$\tilde{Z}^{(\ell)} = \text{LayerNorm}\big(Z^{(\ell)}\big), \qquad Z^{(\ell)}_{\text{attn}} = Z^{(\ell)} + \text{MultiHeadSelfAttn}(\tilde{Z}^{(\ell)}).$
(9)
Feed-forward sublayer:
$\hat{Z}^{(\ell)} = \text{LayerNorm}\big(Z^{(\ell)}_{\text{attn}}\big), \qquad Z^{(\ell+1)} = Z^{(\ell)}_{\text{attn}} + \text{FFN}(\hat{Z}^{(\ell)}).$
(10)

The depth $L$ , hidden size $D$ , MLP size, and number of heads are varied across ViT model variants.

ViT model variants and sizes¶

The original ViT paper defines several standard configurations, similar to BERT:

ViT-Base:
- Layers: 12
- Hidden size $D$ : 768
- MLP size: 3072
- Attention heads: 12
- Parameters: $\approx 86$ M
ViT-Large:
- Layers: 24
- Hidden size $D$ : 1024
- MLP size: 4096
- Attention heads: 16
- Parameters: $\approx 307$ M
ViT-Huge:
- Layers: 32
- Hidden size $D$ : 1280
- MLP size: 5120
- Attention heads: 16
- Parameters: $\approx 632$ M

Notation like ViT-L/16:

“L” refers to the Large configuration,
“/16” refers to a patch size of $16 \times 16$ ,
The sequence length is inversely proportional to the square of the patch size:
- Smaller patches → longer sequences → higher compute cost for attention.

Key observation:

As in NLP, ViT performance tends to improve with larger models and larger training datasets.

Pretraining and data requirements¶

The slides discuss ViT performance on image classification benchmarks and its dependence on pretraining data size.

Findings:

When trained on mid-sized datasets like ImageNet alone, ViT achieves modest accuracies, often below strong CNN baselines.
When pretrained on very large datasets (e.g. JFT-300M, ImageNet-21k) and then fine-tuned, ViT achieves state-of-the-art or competitive performance.

Example summary:

ViT-L/16 and ViT-H/14 pretrained on JFT-300M outperform strong CNN baselines (e.g. “BiT” ResNets, EfficientNet-L2) on a variety of datasets:
- ImageNet,
- CIFAR-10/100,
- Oxford Pets,
- Flowers,
- VTAB tasks.

Data efficiency:

ViT has fewer inductive biases for vision than CNNs:
- It does not encode translation invariance or locality explicitly.
As a result, ViT behaves similarly to language transformers:
- Requires very large pretraining datasets to generalize well.
- Benefits strongly from transfer learning: pretrain on massive data, then fine-tune on specific tasks.

Effect of dataset size (qualitative):

With small pretraining datasets (e.g. ImageNet-1k), larger ViT models can underperform smaller ones, because they overfit and cannot fully exploit their capacity.
As pretraining data grows (ImageNet-21k, JFT-300M), larger models start to dominate and yield higher accuracy.

What does ViT learn?¶

The slides show several visualizations from the ViT paper:

Patch embedding filters¶

The first linear projection that maps flattened patches to embeddings can be visualized.
Applying PCA to the learned $E_{\text{patch}}$ filters and plotting them as images reveals:
- Many filters look like localized edge or color detectors,
- Similar to early layers in CNNs.

This indicates that even without explicit convolution, ViT learns patch-level patterns reminiscent of CNN filters.

Attention distance¶

The mean attention distance of each head and layer can be measured (how far, in patch space, a token tends to attend).
Observations:
- Some attention heads in lower layers already attend to distant patches, providing a large receptive field early on.
- Others focus on nearby patches, capturing local structure.

Analogy:

Attention distance is comparable to the receptive field in CNNs, but:
- Self-attention can access global context in a single layer,
- CNNs need many layers to build such large receptive fields.

Attention maps¶

By visualizing attention weights from the class token (or from certain heads), we see:
- The transformer focuses attention on semantically relevant regions of the image,
- E.g. the object of interest (dog, car, bird) rather than background.

These visualizations support the idea that ViT learns meaningful global and local interactions through attention.

Combining CNNs and attention: motivation (CoAtNet)¶

Despite the strong performance of ViTs with massive pretraining, the slides note:

Transformers in vision often lag behind state-of-the-art CNNs on tasks with:
- Limited data,
- Strong inductive biases needed (e.g. local structure, translation invariance).
Transformers tend to have larger model capacity, but weaker inductive bias:
- They may overfit small datasets,
- Generalization can be worse compared to CNNs trained on the same data.

Idea:

Combine the strengths of convolutions and self-attention in a single architecture.

Use convolution to capture local patterns and provide strong inductive bias.
Use attention to capture global interactions and long-range dependencies.

CoAtNet is one such hybrid architecture explored in the slides.

Convolution and self-attention: mathematical comparison¶

The slides compare depthwise convolution and self-attention in a unified notation.

Let $x_i$ denote the input feature at spatial position $i$ .

Depthwise convolution¶

With a local neighborhood $L(i)$ (e.g. a $3 \times 3$ window), depthwise convolution computes:

y_i = \sum_{j \in L(i)} w_{i-j} \cdot x_j,

(11)

where:

$w_{i-j}$ is a learned kernel weight depending only on the relative position $(i-j)$ ,
The kernel is input-independent,
The operation is local and translationally invariant.

Self-attention¶

Let $G$ denote the set of all positions. Self-attention can be written as:

y_i = \sum_{j \in G} A_{i,j} x_j, \qquad A_{i,j} = \frac{\exp(x_i^\top x_j)}{\sum_{k \in G} \exp(x_i^\top x_k)}.

(12)

Here:

The attention weights $A_{i,j}$ depend on the content (features),
The operation is global (sum over all positions),
No inherent translation invariance (positions must be encoded separately).

Comparison¶

Kernel:
- Convolution: weights $w_{i-j}$ are fixed after training and do not depend on input.
- Attention: weights $A_{i,j}$ are input-dependent and can capture complex relations.
Receptive field:
- Convolution: local neighborhood $L(i)$ (small receptive field per layer).
- Attention: global set $G$ (global receptive field in one layer).
Inductive bias:
- Convolution: relies on relative positions; strong bias for local, translation-invariant features.
- Attention: relies on learned content similarity; more flexible but with weaker structural bias.

This motivates architectures that combine both operations.

Relative self-attention in CoAtNet¶

To combine convolutional and attention-like behaviors, CoAtNet uses relative self-attention.

The idea:

Modify the attention scores by adding a relative positional kernel $w_{i-j}$ :
$y_i^{\text{pre}} = \sum_{j \in G} \frac{\exp(x_i^\top x_j + w_{i-j})} {\sum_{k \in G} \exp(x_i^\top x_k + w_{i-k})} x_j.$
(13)

Here:

$x_i^\top x_j$ is the content-based similarity (as in standard self-attention).
$w_{i-j}$ is a learnable weight depending on the relative position between $i$ and $j$ .
The softmax is applied over all positions $j \in G$ .

Interpretation:

If $w_{i-j}$ is large for nearby positions and small for distant ones, attention is biased toward local neighbors, mimicking convolutional behavior.
If $w_{i-j}$ is more uniform, attention can remain global.
The kernel remains input-independent, encoding structural biases, while the dot-product term incorporates input-dependent interactions.

This relative-attention formulation allows CoAtNet to:

Capture complex content-based dependencies,
Maintain useful inductive biases from convolutions (via relative positions).

CoAtNet vertical design: stages and downsampling¶

Applying global self-attention at the pixel level is computationally prohibitive:

Complexity scales as $O(N^2)$ where $N$ is the number of tokens (pixels or patches).

CoAtNet addresses this with a stage-wise design similar to CNNs:

Input: $224 \times 224$ image.
Stem (S0):
- Convolutional layers downsample to a coarser grid (e.g. $112 \times 112$ ).
Stages S1–S4:
- At each stage, spatial resolution is further reduced (e.g. $56 \times 56$ , $28 \times 28$ , $14 \times 14$ , $7 \times 7$ ),
- The number of channels is increased.

Within stages:

Early stages (higher resolution) use convolutional blocks:
- Standard or depthwise convs,
- $1 \times 1$ convs as bottlenecks,
- Residual connections.
Later stages (lower resolution) use relative self-attention blocks and feed-forward networks.

The slides mention that good results (in terms of generalization, capacity, and transferability) were obtained with:

Three convolutional blocks/stages, followed by
Two transformer blocks/stages.

Global pooling and a fully connected (FC) layer at the end produce classification logits.

This vertical design:

Keeps early computations efficient and local via convolutions,
Uses attention when the sequence length is reduced enough to make it tractable,
Mimics the progressive downsampling seen in ResNets and other CNNs.

CoAtNet results and trade-offs¶

The slides show comparisons of:

Accuracy vs FLOPs,
Accuracy vs number of parameters,

for CoAtNet and competing models.

Qualitative conclusions:

CoAtNet achieves strong accuracy while maintaining:
- Competitive or reduced FLOPs compared to pure transformer or pure CNN variants,
- Good parameter efficiency.
By combining convolution and attention:
- It benefits from convolutional inductive biases on small/medium datasets,
- It leverages attention to capture global interactions and improve performance on challenging benchmarks.

More broadly, hybrid architectures like CoAtNet illustrate that:

Neither pure CNNs nor pure transformers are optimal for all regimes; combining them can yield better accuracy–efficiency trade-offs.

Summary¶

Self-attention and transformer encoders, originally developed for sequences, can be applied to images by:
- Splitting images into patches,
- Embedding patches and adding positional information,
- Prepending a class token and using a transformer encoder.
Vision Transformers (ViT) show that:
- Pure transformer architectures can achieve state-of-the-art performance on image classification,
- But they require large-scale pretraining due to weaker inductive biases than CNNs.
ViT internal behavior:
- Patch embedding layers learn filters similar to early CNN layers,
- Some attention heads attend to distant patches even in lower layers,
- Attention maps focus on semantically important regions of the image.
Convolution vs attention:
- Convolution uses local, input-independent kernels and strong translation-invariance bias,
- Self-attention uses global, input-dependent weights but lacks structured biases,
- Relative self-attention bridges these by adding learnable relative position terms to attention scores.
CoAtNet and similar hybrids:
- Combine convolutional stages for local feature extraction and efficient downsampling,
- With transformer stages for global, content-based interactions,
- Achieve strong performance and favorable accuracy–efficiency trade-offs.

These ideas provide a conceptual foundation for modern vision architectures that increasingly integrate both convolution and attention mechanisms.