DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

November 2021

tl;dr: BEV object detection with DETR structure.

Overall impression

Inspired by DETR, the paper uses sparse queries in BEV space for BEV object detection. It manipulates prediction directly in BEV space. It does not rely on dense depth prediction and avoids reconstruction errors. It is in a way similar to STSU.

Mono3D methods have to rely on per-image and global NMS to remove redundant bbox in each view and in the overlap regions.

The work is further improved with 3D positional embedding by PETR and PETRv2.

The extension of DETR3D to temporal domain is relatively straightforwad, using the 3D reference point, transforming to the past timestamps using ego motion, and then project to the images from the past timestamps.

Key ideas

Technical details