BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving

June 2022

tl;dr: One-stage, multi-task BEV perception and prediction.

Overall impression

This paper combines the recent progress in static BEV perception (such as HDMapNet), dynamic BEV perception (BEVDet, BEVFormer) and motion prediction (FIERY). The motion prediction part largely inherits the spirits of FIERY.

This paper claims to be the 1st paper that performs joint perception and motion prediction, but actually FIERY should be. BEVerse also added static perception to FIERY. The joint perception and prediction idea has also been exploited in lidar perception field, such as FAF.

The paper’s major contribution seems to be the iterative flow for efficient future prediction. Yet the recurrent roll-out method of future prediction is unfriendly to realtime performance in production. Transformer-based method which can predict multimodal future waypoints all at once may be the way to go.

Although BEVerse achieves highest NDS score, this mainly comes from the improved velocity estimation (reduced mAVE error). The mAP is actually worse than most BEV detection work (BEVDet, PETR). BEVDet4D achieves better performance in object detection in both mAP and NDS.

This paper reminds me of BEVDet, which exhibits great engineering skills with great results from existing technical components, but the writing of the manuscript leaves much to be improved.

Key ideas

Technical details