It’s build on the Principles of the Transformer
How we treat a Sentence as a sequence of Words in the Transformer, We treat an Image as an Sequence of Patches. And how the attention scores can be calculated between multiple words and they can help to predict the next word, in the same way we can also use attention scores of the patches of the image to predict the next frames

  • It avoids convolution assumptions and learns all from the convolution spatial data
  • It scales well with data and compute (especially useful for large datasets)
  • It needs a lot of data to match CNN performance, without convolution Priors, ViT are data-hungry and tend to overfit on small datasets

The Solution for the above problems will be:

  • Pre-Train on Large Datasets (JFT, ImageNet-21k)
  • Data Augmentation (Mixup, CutMix, RandAugment)
  • Regularization (DropPath, Label Smoothing)
  • Knowledge Distillation (DeiT: Data-Efficient Image Transformer)

Applications:

  • Semantic Segmentation (SegFormer, SETR)
  • Object Detection (ViTDet, DETR)
  • Video Classification (TimeSformer)
  • Multi-Modal Models (CLIP, Flamingo, GPT-4V)