ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

October 2020

tl;dr: Break images into 16x16 images patches as visual tokens to leverage the scalability of transformers.

Overall impression

This paper, together with earlier efforts from FAIR DETR ushers in an new era of the application of transformers in CV.

Transformers lack some inductive biases inherent to CNNs, such as translation equivariance and locality, and thus do not generalize well when trained on insufficient amounts of data. However when trained on large amount of data, large scale training trumps inductive bias.

Key ideas

Technical details