ViP3D: End-to-end Visual Trajectory Prediction via 3D Agent Queries

September 2022

tl;dr: Introduce MOT to DETR3D.

Overall impression

In a typical autonomous driving pipeline, perception and prediction modules are separated and they communicate via hand-picked features such as agent boxes and trajectories as interfaces. The information loss caps the performance upper bound of the prediction module. Also, errors in perception module can propagate and accumulate, and are almost impossible to correct in prediction. Last, the prediction is typically trained with perception GT, leaving a large gap between inference and training performance.

ViP3D mainly addresses what information to pass between perception and prediction. Instead of explicitly hand-crafted features for each track, Vip3D uses information-rich track queries as the interface between perception and prediction, and reaches end to end differentiability.

Previous efforts in joint perception and prediction include FIERY and BEVerse, but ViP3D is the 1st paper to seriously borrows the SOTA approaches of prediction into the BEV perception task. ViP3D also explicitly models instance-wise agent detection, tracking and prediction in a fully differentiable fashion. FIERY and BEVerse both uses heatmap based method, and makes it inevitable to rely on heuristics and manual postprocessing, leaving these method not end-to-end differentiable.

The query-based methods seems to be the future of end to end autonomous driving. It is borrowed in UniAD. The query-centric information processing seems to be inspired by Perceiver, which was first proposed to break the scaling of computaiton with input size and save computational cost.

Key ideas

Technical details