VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion

March 2023

tl;dr: class agnostic query proposal, plus class specific semantic segmentation.

Overall impression

Key intuition: Visual features on 2D images corresponds only to the visible scene structures rather than the occluded or empty space. It uses the bottom-up depth estimation as the scaffold for 3D scene understanding.

The SSC (semantic scene completion) has to address two issues simultaneously: scene reconstruction for visible areas and scene hallucination for occluded regions.

Real world sensing (cf perception) in 3D is inherently sparse and incomplete. For holistic semantic understanding, it is insufficient to solely parse the sparse measurements while ignoring the unobserved scene structures.

The paper performs depth estimation with monodepth methods first, lift to pseudo-lidar point cloud, then voxelize them into initial query proposals. These sparse queries, coupled with learned masks, use self-attention to densify the sparse prediction.

Why occupancy? –> Occupancy for each cell instead of assigning a fixed size bounding box to an object, could help identify an irregularly-shaped object with an overhanging obstacle.

Key ideas

Technical details