Learning-Deep-Learning

FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection

August 2021

tl;dr: FCOS baseline of mono3D.

The majority of the single stage mono3D methods since SMOKE all use CenterNet as baseline. This paper switches it to FCOS and achieves good results.

Objects are distributed to different feature levels with the consideration if the 2D scales (from reprojected 3D bbox, no 2D annotation is required).

The core challenge of mono3D is how to assign 3D targets to 2D domain with the 2D-3D correspondence and predict them afterwards.

Ambiguity issue: when a point is inside multiple ground truth bboxes in the same feature level. FCOS chooses the bbox with smaller area as the target box. However FCOS3D chooses the box with closer distance. (Prefers objects in the foreground).
Nuscenes has 1000 scenes.
Nuscenes does not use IoU but instead uses 2D center distance d on the ground-plane for decoupling detection from object size and orientation. mAP is calculated by averaging AP at different thresholds D = {0.5, 1, 2, 4} m.
TTA for centerNet or FCOS: averaging score maps by detection heads gives better results than merging bbox at last.

Github page
How did the network learn velocity with only one frame? This is not reliable at all. Lidar based methods have much lower error for velocity.
The current model also struggles with big objects and occluded objects. The authors noted that the former may be due to the not sufficiently large receptive field.