Deep MANTA: A Coarse-to-fine Many-Task Network for joint 2D and 3D vehicle analysis from monocular image

October 2019

tl;dr: Predict keypoints and use 3D to 2D projection (Epnp) to get position and orientation of the 3D bbox.

Overall impression

This is one of the first papers on mono3DOD. It detects 2D keypoints, and solves for the best position through 2D/3D matching.

This study reiterate the idea that 3D vehicle information can be recovered in monocular images because vehicles are rigid bodies with well known geometries.

I feel that the 2D/3D matching is time consuming and discards lots of useful information that can be directly regressed from the image.

Key ideas

Technical details