Learning-Deep-Learning

Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net

July 2019

tl;dr: A single network to do detection, tracking and prediction.

Overall impression

The oral presentation is quite impressive. Modern approaches to autonomous driving has four steps: detection, tracking, motion forecasting and planning.

The assumption of the paper is that tracking and prediction can help object detection, reducing both false positives and false negatives.

More robust to occlusion and sparse data at range. It also runs real-time at 33 FPS.

IntentNet is heavily inspired by Fast and Furious (also by Uber ATG). Both combines perception, tracking and prediction by generating bbox with waypoints. In comparison, IntentNet extends the horizon from 1s to 3s, predicts discrete high level behaviors, and uses map information.

Tracking is done as a postprocessing in FaF. Tracking is then incorporated in the loop of PnP in PnPNet.

Key ideas

Two fusion strategies:
- Early fusion: fuse time dimension from the beginning. This is essentially doing a temporal averaging of all frames.
- Later fusion: use 3x3x3 in two layers to reduce temporal dimension from 5 to 1. This strategy yields better results than early fusion. Note that no LSTM is used in the process.
Decodes tracklets from prediction by average pooling.
- Each timestamp will have current detection and n-1 past predictions.
- If detection and motion prediction are perfect, we can have perfect decoding of tracklets. When the past’s prediction and current detection have overlaps, it is considered to be the same object and bboxes are averaged.
Adding temporal information by taking all 3D points from past n frames. Motion frames need to be ego-motion compensated.
The loss involves prediction of each location being a vehicle and the bounding box position for all timestamps.

Technical details

BEV representation is metric and thus prior knowledge about cars can be exploited. In total 6 anchors are used at 5 m and 8 m scale. (using anchor boxes reduce the variance of regression target thus making the network easy to train).
6 regression targets: location offset x, y, log-normalized sizes w, h and heading parameters sin and cos.
Vehicles with more than 3 points in the 3D bbox are evaluated. Otherwise it is counted as “don’t care”.
Tacking KPIs such as MOTA, MOTP, MT (mostly tracked) and ML (mostly lost).

Notes

How long in seconds can this predict in future? 10 frames does not mean anything. –> According to IntentNet this is only 1 second.
Check the KITTI KPI evaluation criterion on don’t care region.