Learning-Deep-Learning

BEV-feat-stitching: Understanding Bird’s-Eye View Semantic HD-Maps Using an Onboard Monocular Camera

January 2021

tl;dr: predict BEV semantic maps from a single monocular video.

Overall impression

Previous SOTA PyrOccNet and Lift splat shoot studies how to combine synchronized images from multiple cameras into a coherent 360 deg BEV map. BEV-feat-stitching try to stitch monocular video into a coherent BEV map. This process also requires knowledge of the camera pose sequence.

The mapping of the intermediate feature map resembles that of feature-metric mono depth and feature-metric distance in 3DSSD.

To be honest the results do not look as clean as PyrOccNet. Future work may be to combine these two trends, from both BEV-feat-stitching and PyrOccNet.

This paper has a follow-up work STSU for structured BEV perception.

Key ideas

Takes in mono video as input
BEV temporal aggregation module
- Project the features to BEV space
- BEV aggregation (BEV feature stitching) with camera pose.
  - Aggregation is done in a unified BEV grid (extended BEV)
Intermediate feature supervision in camera space with reprojected BEV GT
- Single frame object supervision
- Multiple frames static class

Technical details

200x200 pixels, 0.25 m/pixel, 50m x 50m
The addition of dynamic classes helps with the static classes.

Notes

The evaluation is still in mIoU, treating the problem as a semantic segmentation issue. However we perhaps should introduce the idea of instance segmentation for better prediction and planning.
Stitching may have some noise with extrinsics and pose estimation and deep learning helps smooths this out.