Learning-Deep-Learning

MonoLayout: Amodal scene layout from a single image

June 2020

tl;dr: Predict BEV semantic maps from monocular images.

Overall impression

This is very similar to PyrOccNet.

monolayout uses self-generated ground truth by aggregating results throughout video (so-called temporal sensor fusion). HD Map GT is only used for evaluation.

The authors also listed tricks that did not work. This I think should be the recommended standard practice in future!

The discriminator-based adversarial training is taken one step further to exploit useful prior between vehicle and road layout by PYVA.

Key ideas

View transformation: VAE-like, the latent feature is called “shared context”
Detached dynamic layout and static layout.
- Dynamic layout: this is more related to mono 3D MOD.
  - Instance label
- Static layout is more related to what Tesla is doing.
- Network predicts static or dynamic layout whether it is covered by the camera or not. This is quite different from the method used in PyrOccNet where occluded points are masked.
Architecture
- One encoder, two decoder (dynamic + static)
  - The learned representation must implicitly disentangle the static parts and dynamic objects.
- patch based discriminators
  - Plausible road geometries extracted from unpaired database of openstreetmap.
Generating training data via temporal sensor fusion
- Use monodepth2 or lidar to lift RGB to point cloud.
- With odometry info, aggregate and register the scene observation over time, to generate a more dense, noise free point cloud.
- When using monodepth2, discard anything 5 m away from the ego car as they could be noisy.
- Aggregate 40-50 frames.
- Use GT or predicted semantic labels and aggregate into occupancy grid by majority voting.
Compare with pseudo-lidar, monolayout can achieve equal or better results but much faster.
This work is easily extended to be converted to a behavior predictor.

Technical details

40 x 40 m, 128 x 128 grid.
Realtime, 30Hz on GTX 1080 Ti.
Argoverse contains high-res semantic occupancy grid in BEV.
Things the authors tried but did not work
- Using a single decoder to decode both dynamic and static layout.
Drawbacks: shadows will make the network into predicting protrusions along the shadow direction.

Learning-Deep-Learning

MonoLayout: Amodal scene layout from a single image

Overall impression

Key ideas

Technical details

Notes