Feature-metric Loss for Self-supervised Learning of Depth and Egomotion

August 2020

tl;dr: Feature metric loss to avoid local maxima in monodepth.

Overall impression

Local minima in monocular depth estimation happens as it is sufficient but not necessary for small photometric error. This issue has been tackled by either replacing photometric with feature-metric errors, or using cues to guide optimization out of local minima (Depth Hints and MonoResMatch).

In comparison, Depth Hints still uses photometric loss, and Feature metric monodepth will largely avoid the inferenece of local minima.

The discussion of feature metric loss is perhaps first raised in BA-Net and Deep Feature Reconstruction. It has the advantage to be less sensitive to photometric calibration (camera exposure, white balance) and is dense supervision.

The idea of feature-metric distance as compared to Euclidean distance is also in 3DSSD.

However how to learn this feature map is the key. The paper uses AutoEncoder to do this, and have two extra loss terms to ensure large but smooth gradient, for faster and more general optimization.

Small photometric loss does not necessarily guarantee accurate depth and pose, especially for pixels in textureless region. Depth smoothness loss forces depth propagation from discriminative regions to textureless regions. However such propagation is with limited range and tend to cause over smooth results.

A set of assumptions (for SfM-Learner): the corresponding 3D point is static with Lambertian reflectance and not occluded in both views.

Key ideas

Technical details