Vision Transformers: Straight From The Original Research Paper
A deep dive into the Vision Transformer (ViT) architecture that transformed Computer Vision and learning to build one from scratch
The Transformer architecture changed natural language processing forever and became the most popular model for these tasks after 2017.
At this time, Convolutional Neural Networks (CNNs) were found to be most effective for tasks involving Image processing and Computer Vision.
But this was about to change.
In 2021, a team of researchers from Google Brain published their findings, in which they applied a Transformer directly to the sequences of image patches for image classification tasks.
Their method achieved outstanding results on popular image recognition benchmarks compared to the state-of-the-art CNNs, using significantly fewer computational resources for training.
They called their architecture — Vision Transformer (ViT).
Here’s a story where we explore ViTs from scratch, how they transformed Computer Vision, and learn to build one from scratch directly from the original research paper.
Let’s begin!
But First, What’s So Good With Transformers?
Transformer has been the dominant architecture in LLMs since 2017.
What makes this architecture so successful is its Attention mechanism.
Attention allows a model to focus on different parts of the input sequence when making predictions.
The mechanism weighs the importance of each token in the sequence and captures relationships between different tokens of the sequence regardless of their distance from each other.
This helps the model decide which tokens from the input sequence are most relevant to the token being processed.
A type called the Scaled Dot-Product Attention was introduced in the Transformer architecture.
This is calculated using the following values:
Query (Q): a vector representing the current token that the model is processing.
Key (K): a vector representing each token in the sequence.
Value (V): a vector containing the information associated with each token.
This mechanism helps Transformers reach state-of-the-art performance on language tasks.
The question remained — Can they achieve such performance on Computer Vision tasks as well?
Vision Transformers Are Born
In 2021, a team of researchers from Google Brain applied the Transformer architecture directly to the sequences of image patches for image classification tasks.
They called their architecture — Vision Transformer or ViT.
This architecture follows the original Transformer implementation as closely as possible.
The input to a standard Transformer being trained for a language task is a 1D sequence of token embeddings.
Similarly, to train a Transformer with images, these are first divided into fixed-size, non-overlapping patches.
Each patch is then flattened and linearly projected to create Patch embeddings.
Each patch is equivalent to a ‘token’ for the standard language tasked Transformer model.
These embeddings, along with Positional embeddings, are fed to the Encoder part of a Transformer, and its outputs are used to classify a given input image.
Let’s understand this process in more detail.
Exploring The Vision Transformer One Step At A Time
1. Creating Patch Embeddings
Say that each input image in the training dataset is of dimension H x W x C
, where H
, W
and C
represent the height, width, and channels.
Each image is flatted into N x (P² x C)
patches, where (P, P)
is each patch's resolution and N
is the resulting number of patches.
Alternatively, one can also say that each patch is flattened into a vector of size P² x C
.
Each patch is then mapped into a latent space of D
dimensions using a trainable linear projection. This results in Patch embeddings.
2. Adding Positional Embeddings
Since the Transformer architecture is not sequential, learnable 1-D Positional embeddings are added to the Patch embeddings.
3. Adding The x(class) Token
The next step borrows inspiration from the BERT architecture, where a special [CLS]
token is added at the beginning of each input sequence.
This [CLS]
token represents the entire input sequence, and its final hidden state representation C
(shown in the image below) is used as an input for further classification tasks.
Similar to the above, an x(class)
token is added to the sequence of Patch embeddings.
This token, after being processed by the L
layers of the Transformer block is denoted as z(L)
and captures the global representation of the entire image.
The previous equation is thus modified to the following:
This is the resulting input for the Transformer’s Encoder block.
4. Feeding Embeddings To Transformer’s Encoder Block
Similar to the BERT architecture, only the Transformer’s Encoder block is used to process the embeddings.
This block consists of alternating layers of Multi-Head Attention and MLP blocks (consisting of two feed-forward layers with a GELU non-linearity).
Layer normalization (Norm
) is applied before every block, and Residual connections (represented by the +
sign in the image) are added after every block.
5. Adding A Classification Head
A Classification head processes the output z(L)
from the Encoder.
This output is the compact representation of the entire image through the x(class)
token.
During ViT’s pre-training, this classification head is an MLP with one hidden layer. However, during fine-tuning, it is simplified to a single linear layer.
The following equations summarise how embeddings are transformed in the process.
Does Vision Transformer Perform Really Well?
Convolutional Neural Networks (CNNs) come with built-in biases that help them perform better with images.
These are:
Locality: The assumption that nearby image pixels are related.
2-D neighbourhood structure: The assumption that the 2D spatial arrangement of image pixels (height and width) matters towards an image’s meaning
Translation Equivariance: The assumption that if an object in an image shifts to another location, the features representing that object should shift similarly. In other words, the object’s identity won’t change just because it has moved to another location in the image.
Vision Transformers (ViTs) lack these built-in inductive biases.
ViTs use the Self-attention mechanism that treats each patch of an image independently without assuming any spatial patterns.
Thus, they have to learn how an image is structured from scratch.
Despite this, they absolutely smash the previous state-of-the-art models.
How do they achieve this?
ViTs learn the spatial structure of an image by learning the Positional embeddings of different patches during training.
Their Multi-head attention mechanism allows some heads to focus on large sections of the image early on and others to focus on more minor and local details.
And, this attention mechanism overall captures and prioritizes the semantically important parts of an image.
In the original research paper, ViT’s performance is compared against two leading CNN-based models:
Big Transfer (BiT): Based on ResNet that uses transfer learning through supervised pre-training
Noisy Student: Based on EfficientNet, trained using semi-supervised learning
The results show that ViTs not only exceed the performance of the baselines but also do so with reduced computational cost (evident with the lower TPUv3-core-days required for pre-training).
Vision Transformers (ViTs) also generalize well in their ability to classify images across natural, specialized, and structured domains as part of the Visual Task Adaptation (VTAB) benchmark.
Coding Up A Vision Transformer From Scratch
It’s time to implement what we have learned above.
Keep reading with a 7-day free trial
Subscribe to Into AI to keep reading this post and get 7 days of free access to the full post archives.