MonoDIS: Disentangling Monocular 3D Object Detection

August 2019

tl;dr: end2end training of 2D and 3D heads on top of RetinaNet for monocular 3D object detection.

Overall impression

The paper is articulate in specifying the input and output of the network, and info required at training and inference. It does not require depth information, only the RGB and the 3D bbox info.

The paper proposes a disentangling transformation to split the original combinational loss (e.g., size and location of bbox at the same time) into different groups, each group only contains the loss of one group of parameters and the rest using the GT. Note that sometimes the loss is already disentangled, such as those originally proposed by YOLO or Faster RCNN. This only applies to losses with complicated transformation such as 3d bbox corner loss and sIOU loss as proposed in this paper.

The papar is further enhanced by MoVi-3D to reach SOTA.

Decoupled structured polygon formulated the problem of 3d monocular object detection as two decoupled tasks: 2d projection prediction and 1d depth prediction.

Key ideas

Technical details