TSP: Rethinking Transformer-based Set Prediction for Object Detection

November 2020

tl;dr: Train DETR faster and better with sparse attention and guided Hungarian matching.

Overall impression

The paper digs into the reasons why DETR is so hard to train, primarily the issues with Hungarian loss and the Transformer cross attention mechanism. The paper proposed two improved version of DETR based on existing solutions, TSP-FCOS (improved FCOS) and TSP-RCNN (improved Faster RCNN).

This work basically borrows the best practice in modern object detector (FPN, FCOS and Faster-RCNN) and replaces the dense prior heads with DETR encoder, and uses set prediction loss.

There are several papers on improving the training speed of DETR.

A standard 36 epochs (3x schedule) can yield a SOTA object detector.

Object detection is essentially a set prediction problem, as the ordering of the predicted objects is not required. Most modern object detectors uses a detect-and-merge strategy, and makes predictions on a set of dense priors. The dense priors makes NMS necessary. The detection model is trained agnostically wrt the merging step, so the optimization is not end-to-end and arguably sub-optimal.

DETR removes the handcrafted parts such as dense prior design, many-to-one label assignment problem and NMS postprocessing.

DETR removes the necessity of NMS as self-attention component can learn to remove duplicated detection. The Hungarian loss encourages one target per object in the bipartite matching.

Key ideas

Technical details