BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection

July 2022

tl;dr: LSS with explicit depth supervision and Efficient Voxel Pooling.

Overall impression

In LSS, depth estimation is implicitly learnt without camera info. The accuracy of depth estimation is surprisingly inadequate (pred-gt curve shows very poor correlation). Replacing the depth with ground truth depth will lead to huge improvement, indicating that the quality of intermediate depth is the key to improving multi-view 3D object detection.

The idea is relatively simple, but the engineering effort was executed beautifully, with in-depth analysis.

Key ideas

Technical details