Learning-Deep-Learning

Visibility Guided NMS: Efficient Boosting of Amodal Object Detection in Crowded Traffic Scenes

June 2020

tl;dr: Use modal bbox to guide NMS of amodal bbox.

Overall impression

The paper addressed a critical issue for autonomous driving in parking lot and in urban areas where many parked cars are heavily occluded. Same issue happens to crowds of pedestrians as well.

Humans perceive the world by predicting the bbox covering the entire object even if it is occluded. This is called amodal perception. (cf. Amodal completion)

This is very similar to R2 NMS in CVPR 2020, which focuses on crowd pedestrian detection.

Key ideas

Training object detector with 4 additional attributes. Thus it predicts both the visible part (pixel-based bbox) and the entire object (amodal bbox).
VG-NMS: NMS is performed on the pixel-based bbox that describe the actually visible parts but output the amodal bboxes that belong to the indices that rare retained during pixel-based NMS.
Pixel based modal bbox can be generated from segmentation mask. –> Or they could be generated from the ordering of amodal bbox based on geometric priors. For example, bbox with large ymax is closer to camera.

Technical details

don’t care objects: KITTI ignore 25x25 pixels, and cityscape ignore 10x10 pixels.
VG-NMS is better than soft NMS. Soft NMS does not seem to improve performance much over NMS.
Simultaneously regressing amodal and modal bboxes leads to better performance, under standard NMS.

Notes

The odal bbox (pixel bbox) used in VG-NMS can be derived from amodal bbox by sorting orders and mark the non-occluded part.