SS3D: Monocular 3D Object Detection and Box Fitting Trained End-to-End Using Intersection-over-Union Loss

October 2019

tl;dr: CenterNet like structure to directly regress 26 attributes per object to fit a 3D bbox.

Overall impression

The paper uses a CenterNet architecture to regress bounding boxes. The support region is like the Gaussian kernel at the center of the object. The donut region surrounding the kernel is “don’t care” region.

The algorithm requires 3D GT in the first place, and requires accurate intrinsics. (KITTI 3D bbox GT is given in camera coordinate, thus extrinsics does not matter.)

SS3D directly predicts 2D and 3D bboxes, similar to M3D RPN and D4LCN.

This paper also demonstrates the possibility to directly regress the distance of cars from 2D images. See youtube videos. This looks quite similar to Nvidia’s drive demo.

Key ideas

Technical details