Learning-Deep-Learning

Recurrent SSD: Recurrent Multi-frame Single Shot Detector for Video Object Detection

January 2020

tl;dr: Using history to boost object detection on KITTI.

Overall impression

There is a series of paper on recurrent single stage method for object detection. The main idea is to add RNN layer directly on top of the entire image feature.

Another way to look at feature aggregation over time is data fusion. Instead of fusing information from different sensors, it is fusing information from different time-stamp. The fusion technique can be element wise (addition or max), concatenation or recurrent layer.

This is perhaps the best clean solution to video object detection problem. Much cleaner than ROLO.

K=4 frames

Key ideas

Augment SSD meta-architecture by conv-recurrent layer (conv-GRU). This maintains the fully-convolutional feature of SSD, keeping it fast.
Two ways to integrate information from multiple frames. These two are orthogonal to each other and can be used together.
- Feature level: accumulate feature maps across time. Such as towards high performance video object detection.
- Box level: tracking by detection.
It does not require extra labeled training data as only the final time-stamped image needs labeled bounding boxes.
The aggregated feature map in the recurrent layer can be used for visualization. It can recover heavily occluded object (similar to ROLO).

Technical details

Late concatenation (right after backbone and before detection head), with additional conv layer achieves almost as good performance, but it is taking 4 images as input and slows down the system a lot. This can be improved by a circular buffer but adds to the complexity of the system.
Recurrent SSD achieves over 2.7 mAP improvement over single frame SSD on KITTI.

Notes

Association LSTM is inspired by Siamese network for re-identification and works Hungarian algorithm to be jointly trained with neural network.
How to efficiently introduce LSTM with >1 stride? Keeping two states?