Learning-Deep-Learning

MT-CNN: Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks

September 2019

tl;dr: One of the most widely used method for face detection and face landmark regression.

Overall impression

The paper seems rather primitive compared to general object detection frameworks like faster rcnn. MTCNN is more like the original rcnn method.

However it is also enlightening that a very shallow CNN (O-Net) applied on top of cropped image patches can regress landmark accurately. Landmark regression given an object bbox may not require that large of a receptive field anyway.

The paper is largely inspired by Hua Gang’s paper cascnn: A Convolutional Neural Network Cascade for Face Detection.

Key ideas

Three stages
- P-Net: proposal network on 12x12 input size
- R-Net: FP reduction on 24x24 input size
- O-Net: landmark regression on 48x48 input size
P-Net is trained on patches but deployed convolutionally for detection. (or equivalently in a sliding window fashion)
R-Net input is obtained from the output of P-Net
O-Net input is obtained from the output of R-Net
Multi dataset used differently
Loss weighed differently and masked differently in different stages

Technical details

Not a single model, but training can be done jointly.

Notes

new implementation in TF and original version
The method to crop an image and resize is essentially ROI align applied to the original image. Maybe we can save some computation by cropping the feature map from the first stage. The features from the fourth stage is most likely already too spatially blurred to contain any localization info.