Learning-Deep-Learning

UR3D: Distance-Normalized Unified Representation for Monocular 3D Object Detection

August 2020

tl;dr: Use distance-related feature transformation prior to ease the learning.

Overall impression

The paper used depth normalization for monocular 3D object detection. Similar ideas have been used in Monoloco and BS3D.

The paper has the good idea to learn the scale-invariant features and linearly scale them properly according to the FPN levels the regression heads are attached to. –> This idea seems to be similar to MoVi-3D and should be explored together.

However the prediction tasks have known relationship according to scale, and we do NOT need to explicitly learn them. For example, bbox sizes are linearly related to scale, and depth scales inverse linearly to scale, both with a factor of 2 every FPN layer. The paper also seems to confuse the notion of depth (z) and distance (l2norm((x, y, z)).

The results are not SOTA as compared to pseudo-lidar or AM3D. It is further lagging behind contemporary work PatchNet.

UR3D is largely based on the architecture of FCOS. Similarly, SMOKE is based on CenterNet.

Key ideas

Features can be grouped into the three categories.
- scale invariant tasks:
  - object class
  - physical size
  - orientation
- scale linear tasks. Each level of FPN regresses one scaling constant.
  - bbox
  - keypoint location
- scale nonlinear tasks. Each level of FPN regresses one scaling constant.
  - depth prediction
Uses DORN to generate a depth prediction patch to guide the learning of depth. Thus the network only needs to learn the residual values.
Distance-guided NMS: As distance or depth prediction is the key to accurate 3D object detection, the NMS is guided by distance prediction accuracy, not clf score alone.
- Use depth conf * cls conf as sorting criteria, and used to weight average the depth value.
Fully convolutional cascaded point regression
- 1st stage: regress the location of center point first
- 2nd stage: use deformable convnet framework to pool all related points and predict the residual location offset.
Postprocess to optimize 3D bbox according to predicted 9 keypoints and regressed physical size. This step uses projection loss and also residual loss with predicted size.

Technical details

Losses
- Wing loss for distance, size and orientation estimation.
- Smooth L1 loss for keypoint regression.
- IoU loss for bbox regression

Notes

Questions and notes on how to improve/revise the current work