M3D-RPN: Monocular 3D Region Proposal Network for Object Detection

October 2019

tl;dr: Regress 2D and 3D bbox parameters simultaneously by precomputing 3D mean stats for each 2D anchor.

Overall impression

M3D RPN directly regresses 2D and 3D bboxes (11 + num_class), similar to SS3D which directly regresses 26 numbers, and D4LCN which directly regresses 39 numbers.

The algorithm requires 3D GT in the first place, and requires accurate intrinsics. For dataset without intrinsics, it may be necessary to predict intrinsics as a weakly supervised problem.

The paper focuses on a single-stage 2D and 3D regression task, and yields less accurate results particularly for yaw (postprocessing leads to 5% accuracy increase). This is evidenced by the post-processing yaw adjustment stage. It takes additional 18 ms, too slow for real time application.

The paper is correct that many previous SOTA algorithm uses pretrained component and they sometimes introduce constant noise in training.

It can do mono3D for cyclists and pedestrians.

The depth aware convolution network is extended further in learning depth guided conv.

This work forms the baseline for kinematic mono3D which performs monocular video 3D object detection.

This work also directly inspired D4LCN, which brings the idea of depth aware convolution one step further, and uses the 2D/3D anchor idea.

Key ideas

Technical details