DeepV2D: Video to Depth with Differentiable Structure from Motion

July 2020

tl;dr: Video to depth with iterative motion and depth estimation.

Overall impression

The structure of the system seems to be a bit convoluted, where the training and inference are quite different.

Self-supervised methods such as SfM-learner and Monodepth uses geometric principles for training alone, and doe no use multiple frames to predict depth at inference. Similar works include DeMoN and DeepTAM and MVSNet.

This work also proposes to use geometric constraint, but instead of minimizing photometric error in LS-Net or feature-metric error in BA-Net), Flow-SE3 module mimimizes geometric reprojection error (difference in pixel location) leads to better well-behaved optimization problem. –> the idea of residual flow is explored in GeoNet but it still uses photometric error.

DeepV2D is similar to BA-Net.

It seems to be more practical than Consistent Video Depth Estimation as it converges quickly during inference (5-6 iterations).

Key ideas

Technical details