MonoLayout: Amodal scene layout from a single image

June 2020

tl;dr: Predict BEV semantic maps from monocular images.

Overall impression

This is very similar to PyrOccNet.

monolayout uses self-generated ground truth by aggregating results throughout video (so-called temporal sensor fusion). HD Map GT is only used for evaluation.

The authors also listed tricks that did not work. This I think should be the recommended standard practice in future!

The discriminator-based adversarial training is taken one step further to exploit useful prior between vehicle and road layout by PYVA.

Key ideas

Technical details