Learning-Deep-Learning

TrianFlow: Towards Better Generalization: Joint Depth-Pose Learning without PoseNet

May 2020

tl;dr: Use optical flow and dense 2D-2D to solve for local pose and align with depth prediction.

Overall impression

The name seems to come from “triangulate-flow”.

PoseNet lack generalization ability (performs badly for long sequence where relative pose across sequence is hugely different, when video is speed up, and also hardly beats image retrieval baseline).

The idea of using optical flow to calculate relative pose is very similar to DF-VO. The main difference

DF-VO has pre-trained depth and flow network separately with PoseNet-like architecture while TrianFLow got rid of PoseNet altogether and uses the triangulation to perform self-supervision.
DF-VO is based on SC-sfm-learner to ensure consistency and aligns pose to depth. TrianFlow aligns depth to pose. (Why?)

The knowledge of correspondence (matching) does not have to be learned by PoseNet and thus improves network generalization ability.

Key ideas

FlowNet is based on PWC-Net
Scale is explicitly disentangled at both training and inference.

Training:

optical flow to get dense matching
forward-backward consistency to generate score map Ms
Sample points that survives occlusion mask Mo and top 20% forward-backward score.
8 pt algorithm in RANSAC + cheirality check to solve F matrix and R t.

Based on R

t and correspondence, get triangulated point depth with mid-point triangulation to get up-to-scale 3d structure. Points around epipoles (vanishing points) are removed for triangulation.

Dense predicted depth is aligned to sparse triangulated depth. The 3d structure’s scale is determined by relative pose scale. The triangulated depth is used as pseudo-depth signal to supervise depth prediction

Inference (same as DF-VO)
- Calculate fundamental matrix from optical flow
- When optical flow is too small, use PnP to solve for relative pose.
TrianFlow can generalize to unseen ego motion.
- For 3x fast sequence, ORB-SLAM2 frequently fails and reinitializes under fast motion
The results is better than most other end-to-end methods, but not a good as DF-VO.

Technical details

Occlusion map, Mo
Flow consistency score map, Ms
The recovered pose from optical flow is obtained using cv2.recoverPose and has unit length t.
inlier score map, Mr, by computing distance map from each pixel to its corresponding epipolar line. Implementation of inlier mask in code
Angle mask: filter out points close to epipoles. Implementation of angle mask
During training, the sparse triangulated depth are up to scale, and the depth difference is normalized by the sparse depth value again, and thus the depth loss is scale invariant.

Notes

Q: the scale normalization is there to ensure a consistent scale between depth and flow, but what ensures a scale consistency across frames? –> This seems to be learned implicitly by the depth network. Now the depth network only has to focus on learning the relative depth, and the scale consistency seems to be come from the continuity assumption of the network, that a continuous change in image leads to continuous change in depth prediction. But adding the scale consistency loss proposed in SC-SfM-learner does not seem to hurt?
The paper
During inference, the code actually assumes depth predictions have consistent scale and thus aligns pose to depth.

The central idea of existing self-supervised depth-pose learning methods is to learn two separated networks on the estimation of monocular depth and relative pose by enforcing geometric constraints on image pairs.