PYVA: Projecting Your View Attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation

September 2021

tl;dr: Transformers to lift image to BEV.

Overall impression

This paper uses a cross-attention transformer structure (although they did not spell that out explicitly) to lift image features to BEV and perform road layout and vehicle segmentation on it.

It is difficult for CNN to fit a view projection model due to the locally confined receptive fields of convolutional layers. Transformers are more suitable to do this job due to the global attention mechanism.

Road layout provides the crucial context information to infer the position and orientation of vehicles. The paper introduces a context-awre discriminator loss to refine the results.

Key ideas

Technical details