OFT: Orthographic Feature Transform for Monocular 3D Object Detection

July 2019

tl;dr: Learn a projection of camera image to BEV for 3D object detection.

Overall impression

This paper is brilliant! It combines several key innovations in the past year: camera to BEV projection (similar to pseudo-lidar), and anchor-free object detection (similar to CenterNet).

However the way of reprojection without depth estimation perhaps limited the performance of the model, which is significantly below that of MLF and pseudo-lidar. For simple depth estimation and 3D reasoning using 2bbox and CV, refer to Deep3dBox and MonoPSR.

The OFT transformation network inspired future work of PyrOccNet for monocular bev semantic map prediction, and MonoScene for semantic scene completion (although not mentioned explicitly).

The network does not require explicit info about intrinsics, but rather learns the constant mapping. That is why extensive augmentation was required to do this. –> why not injecting intrinsics implicitly?

Key ideas

Technical details