12. Vision Transformer

It’s build on the Principles of the Transformer
How we treat a Sentence as a sequence of Words in the Transformer, We treat an Image as an Sequence of Patches. And how the attention scores can be calculated between multiple words and they can help to predict the next word, in the same way we can also use attention scores of the patches of the image to predict the next frames

It avoids convolution assumptions and learns all from the convolution spatial data
It scales well with data and compute (especially useful for large datasets)
It needs a lot of data to match CNN performance, without convolution Priors, ViT are data-hungry and tend to overfit on small datasets

The Solution for the above problems will be:

Pre-Train on Large Datasets (JFT, ImageNet-21k)
Data Augmentation (Mixup, CutMix, RandAugment)
Regularization (DropPath, Label Smoothing)
Knowledge Distillation (DeiT: Data-Efficient Image Transformer)

Applications:

Semantic Segmentation (SegFormer, SETR)
Object Detection (ViTDet, DETR)
Video Classification (TimeSformer)
Multi-Modal Models (CLIP, Flamingo, GPT-4V)

Sadiq's Knowledge Vaults

Explorer

12. Vision Transformer

Graph View