ROI-10D: Monocular Lifting of 2D Detection to 6D Pose and Metric Shape

August 2019

tl;dr: Concat depth map and coord map to RGB features + 2DOD + car shape reconstruction (6d latent space) for mono 3DOD.

Overall impression

Like MLF, this paper only concat D to RGB, making the performance sub-optimal.

Surprisingly the shape can be approximated pretty well even with 1D (scaling factor). 6D is chosen for this paper to include more details. However this work still concats depth to RGB features instead of lifting RGB into point cloud. This is clearly inferior to other SOTA methods such as pseudo-lidar and pseudo-lidar++.

The idea of compressing shapes is also found in Mask Encoding Instance Segmentation.

The paper articulated that depth has to be reasoned globally.

Key ideas

Technical details