To Learn or Not to Learn: Visual Localization from Essential Matrices

May 2020

tl;dr: 3D structure > SIFT + 5pt solver > Neural network based.

Overall impression

The paper builds on Understanding APR and demonstrated that the good old SIFT + 5 pt solver is still the state of the art for relative pose estimation without 3D structures. 3D structures can achieve better results but require scene specific 3D modeling and lack generalization.

Relative localization has three steps: feature extraction, find matching, and calculate essential matrices (or R and t).

The bottleneck of DL based approach is the matching and pose regression part. DL regression cannot generalize to new scenes as DL cannot properly learn implicit matching by regression network.

However, even if we replace the pose regression with 5 pt solver, it still cannot beat SIFT + 5 pt solver. This is mainly due to that current CNN features are coarsely localized on the image, that is, the features from the later layers are not mapped to a single pixel but rather an image patch. All the self-supervised keypoints learner feature based methods still cannot beat SIFT consistently. I wrote a blog about self-supervised keypoint learning here. As pointed out in an open review for KP2D, “the problem is old yet not fully solved yet, because handcrafted SIFT is still winning the benchmarks.”

Key ideas

Technical details