How Do Neural Networks See Depth in Single Images?

December 2019

tl;dr: Probes the monodepth estimator as blackboxes and see how the different estimators reacts to changes of different geometric cues.

Overall impression

The paper performs the missing “ablation study” for the monocular depth estimators. It discovers that all depth estimators examined uses the vertical position of the object as depth cues.

Video-based method such as SLAM or SfM tend to treat depth estimation as pure geometrical problem, ignoring the contents of the image.

The depth estimation networks learns to find where the object touches ground and fill in the depth in the object contour. It uses a dark, thick edge (shadow below cars) to detect the region where object touches ground.

Key ideas

Technical details