Learning-Deep-Learning

CenterTrack: Tracking Objects as Points

July 2020

tl;dr: Use CenterNet to predict offset between neighboring frames. Nearest neighbor would work well

Overall impression

CenterNet achieves the SOTA of mono3D on nuScenes as of 07/04/2020. On nuScenes it achieves 34 mAP, almost as good as LIDAR based approach one year ago by PointPillars at 31 mAP. (The lidar SOTA performance is refreshed by CenterPoint to 60 mAP.)

The complete cycle

In early days of computer vision, tracking was phrased as following interest points through space and time. It then got beaten by “tracking by detection” (or tracking following detection) which follows detected bounding box throughout time. The tracking stage is usually slow and complicated association strategy (Hungarian matching) by either IoU of 2D bbox or learnt embedding vectors of each object, or with motion model and EKF. The simple displacement prediction is akin to sparse optical flow. KLT (Kanade Lucas Tomasi) contains GFT (good feature to track) + LK (Lucas Kanade 金出武雄@CMU). CenterNet is like GFT and the offset vector is like LK.

The first simultaneous detection and tracking paper Tracktor which predicts bbox detection and bbox offset for tracking simultaneously. CenterTrack simplifies the process further by reducing bounding box to a point with attributes.

Previous joint detection and tracking work are all based on Faster RCNN two-stage framework where the tracked bbox are used as region proposals. This has two issues: 1) It assumes that the bounding box has large overlaps which is not true in low-frame rate regimes (such as in nuScenes) 2) it is hard to run MOT in real time.

It feeds detection results from previous frame as additional input to boost performance in the current frame. This tracking-conditioned detection is like autoregressive models. This trick is also used in ROLO. This provides a temporally coherent set of detection objects. (Only location, not size).

Key ideas

Technical details

Notes