CVT: Cross-view Transformers for real-time Map-view Semantic Segmentation

July 2022

tl;dr: Dense prediction version of PETR.

Overall impression

The paper proposed the idea of 3D positional embedding to help cross attention module learn the correspondence between image view and BEV view. This idea is very much like that of PETR series. CVT does not use sparse object query to learn end to end object detection. Instead, CVT uses cross attention and 3D PE to transform perspective features into BEV features, and attach conventional object detection to BEV features.

Each camera uses optional embeddings that depend on its intrinsics and extrinsic calibration.

The performance is roughly on par with FIERY static, but is much simpler, easier to train and easier to deploy. This proves that CVT combines features in a more efficient manner.

Key ideas

Technical details