Learning-Deep-Learning

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

March 2022

tl;dr: Spatiotemporal transformer for BEV perception of both dynamic and static elements.

Overall impression

BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. This includes a spatial cross-attention and a temporal self-attention. –> actually also a type of cross-attention.

The paper believes the previous methods based on depth prediction (Lift Splat Shoot, CaDDN, PseudoLidar) are subject to compounding errors, and thus favors direct method.

It significantly outperforms previous SOTA DETR3D by 9 points in NDS score on nuScenes dataset.

The introduction of temporal module helps in 3 directions: 1) more accurate velocity estimation 2) more stable location and orientation 3) higher recall on heavily occluded objects.

Key ideas

Technical details

Notes