MLF: Multi-Level Fusion based 3D Object Detection from Monocular Images

Aug 2019

tl;dr: Estimate depth map from monocular RGB and concat to be RGBD for mono 3DOD.

Overall impression

This paper inspired a more influential paper, pseudo-lidar. Especially, Figure 3 basically has the idea of projecting depth map to point cloud, but it was only used for visualization of the detection results. From Fig. 3 it is quite natural to think about object detection with this pseudo-point cloud. Unfortunately, the paper just concat the D to RGB and yielded only suboptimal performance (similar to ROI 10D).

The idea of regressing the 3D location through local features and global features is correct, but the formulation (addition of predictions from the two branches) are doubtful. Why not concat the features?

This paper also separated the highly correlated pair of depth estimation with size estimation. Directly regressing these two together can be ill-posed, as pointed out by Deep3Dbox.

Overall the idea of the paper is good, but implementation is not optimal.

Key ideas

Technical details