Learning-Deep-Learning

TPVFormer: Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

February 2023

tl;dr: Academic alternative to Tesla’s Occupancy Network, by lifting BEVFormer to 3D.

Overall impression

The paper claims that the model uses sparse supervision at training but can predict more consistent and comprehensive volume occupancy for all voxels at inference time. The prediction is indeed denser than the sparse annotation, but not really dense, as compared to SurroundOcc.

TPV extends the idea of BEV to 3 orthogonal axis, and thus models 3D without suppressing any axes and avoiding cubic complexity.

The architecture is innovative, but the performance suffers from sparse annotation. SurroundOcc showed that with dense annotation, the performance can be boosted 3x.

The paper tried to differentiate semantic occupancy prediction (SOP) and semantic scene completion (SSC). According to this paper, SOP uses sparse semantic supervision from single-frame lidar point cloud, but SSC is supervised with dense voxel labels. But it can be shown later that denser annotation can be obtained (either by Poisson Recon in SurroundOcc or Augmenting And Purifying (AAP) in OpenOccupancy) the annotation the two tasks are quite similar.

Key ideas

Technical details

Notes