Depth Hints: Self-Supervised Monocular Depth Hints

July 2020

tl;dr: Use depth pseudo-label to guide the self-supervised depth prediction out of local minima.

Overall impression

This paper digs into self-supervised learning and provides tons of insights, in a fashion similar to What Monodepth See.

It first showed that the photometric loss function (DSSIM + L1) used in monodepth can struggle to find global minimum and get trapped in local minima during training. Then it provides a way to effectively use depth pseudo-label, in a soft supervised way. Depth hints are used when needed to guided the network out of local maxima. –> In a way, it is similar to the idea of using the minima of reprojection loss from multiple frames as in Monodepth2.

This paper proposed a way to consume possibly noisy depth label together with self-supervised pipeline, and is better than using supervised signal alone, or simply sum the two loss together.

Another way to avoid local maxima is to use feature-metric loss instead of photometric loss, such as in Feature metric monodepth, BA-Net and Deep Feature Reconstruction. In comparison, Depth Hints still uses photometric loss, and Feature metric monodepth will largely avoid the inferenece of local minima.

Both Depth Hints and MonoResMatch propose to use cheap stereo GT to build up monodepth dataset. Depth Hints uses multiple param setup to obtain an averaged proxy label and use a soft (hint) supervision scheme. MonoResMatch uses left-right consistency check to filter out spurious predictions and a traditional hard supervision scheme.

Key ideas

Technical details


max_n = 2
xs = np.arange(-max_n, max_n, max_n/100)
delta = 0.2 * max_n
logl1 = np.log(1 + np.abs(xs))
berhu = np.abs(xs) * (np.abs(xs) < delta) + (xs ** 2 + delta ** 2) / (2 * delta) * (np.abs(xs) >= delta)
plt.plot(xs, logl1, label='logl1')
plt.plot(xs, berhu, label='berhu')

We provide two further intuitions with respect to the difference between L2 and berHu loss. In both datasets that we experimented with, we observe a heavy-tailed distribution of depth values, also reported in [27], for which Zwald and LambertLacroix [40] show that the berHu loss function is more appropriate. This could also explain why [5, 6] experience better convergence when predicting the log of the depth values, effectively moving a log-normal distribution back to Gaussian. Secondly we see the greater benefit of berHu in the small residuals during training as there the L1 derivative is greater than L2’s. This manifests in the error measures rel. and δ1 (Sec. 4), which are more sensitive to small errors.