OneNet: End-to-End One-Stage Object Detection by Classification Cost

December 2020

tl;dr: Easy-to-deploy end to end object detector. Classification cost is the key to remove NMS.

Overall impression

This paper is from the authors (孙培泽 et al) of Sparse RCNN. This paper has a focus on deployment: single stage, association with min cost function (instead of Hungarian). This is perhaps one of the best paper written during the boom of end-to-end object detectors.

Existing approaches (anchor-based or anchor-free) assign labels by location cost only, either IoU based Box Assignment or point distance based Point Assignment. Without classification cost, only localization cost leads to redundant boxes of high confidence scores in inference, making NMS a necessary post-processing component.

The idea of introducing classification to handle sudden changes in the value domain is a quite common practice, as in practice, sudden changes reflects multimodality. For example, to handle the oversmoothing issue in depth prediction, Depth Coefficient and SMWA introduces classification of multiple models. The regression model responds smoothly to smooth changes in input. Therefore if we only have localization cost as matching cost, we will have duplicate bboxes around GT boxes.

DeFCN and OneNet:

Key ideas

Technical details

DETR can be viewed as the first end-to-end object detection method; DETR utilizes a sparse set of object queries to interact with the global image feature. Benefiting from the global attention mechanism and the bipartite matching between predictions and ground truth objects, DETR can discard the NMS procedure while achieving remarkable performance. Deformable-DETR is introduced to restrict each object query to a small set of crucial sampling points around the reference points, instead of all points in the feature map. Sparse R-CNN starts from a fixed sparse set of learned object proposals and iteratively performs classification and localization to the object recognition head.

# C is cost matrix, shape of (nr_sample, nr_gt)
C = cost_class + cost_l1 + cost_giou

# Minimum cost, src_ind is index of positive sample
_, src_ind = torch.min(C, dim=0)
tgt_ind = torch.arange(nr_gt)