Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D

September 2020

tl;dr: Predict depth distribution of each pixel for differentiable rendering of a BEV map.

Overall impression

The paper is build on top of quite a few previous work such as OFT, PyrOccNet, MonoLayout and pseudo-lidar.

It proposed probabilistic 3D lifting through prediction of depth distribution for a pixel in the RGB image. In a way it proposed a unified lifting method between the one-hot lifting of pseudo-lidar and the uniform lifting of OFT. This is a trick commonly used in differentiable rendering. –> Actually Pseudo-Lidar v3 also uses this soft rasterizing trick to make depth lifting and projection differentiable.

The semantic BEV map prediction need to fuse predictions from all cameras into a single cohesive representation of the scene. This is full presentation learning of the entire 360 scene local to the ego vehicle conditioned exclusively on camera input. The ultimate goal of the BEV map prediction is to learn dense representation for motion planning.

Fishing Net uses BEV grid resolution: 10 cm and 20 cm/pixel. Lift Splat Shoot uses 50 cm/pixel. They are both coarser than the typical 4 cm or 5 cm per pixel resolution used by mapping purposes such as DAGMapper.

Key ideas

Technical details