Learning-Deep-Learning

DETR:End-to-End Object Detection with Transformers

June 2020

tl;dr: Transformer used for object detection as direct set prediction .

Overall impression

Formulate the object detection problem as direct set prediction problem. No need for engineering-heavy anchor boxes and NMS.

The attention mechanism from transformers is similar to Non-local Networks. The attention has perfect memory and has same “distance” between any two points in the image.

Previous methods such as anchor-based or anchor-free, implicitly enables a sorting of GT and prediction. The Hungarian loss used in DETR eliminates that altogether by comparing loss between two unordered sets.

This paper is extended by Deformable DETR to speed up the training of transformers by more than x10.

Key ideas

Technical details

Notes