October 2019
tl;dr: BEV localization for pedestrians with uncertainty.
Overall impression
Uses off the shelf human detector and 2D joint detector (Mask RCNN and PifPaf). It exploits the relatively fixed height of pedestrians and in particular, shoulderhip segment (~50 cm) to infer the depth.
The paper also has realistic prediction of uncertainty through aleatoric/epistemic uncertainty. This helps to mitigate those highrisk cases where GT distance is smaller than the predicted one (for which an accident is more likely to happen).
This idea can be readily exploited for mono 3DOD of cars (rigid body with known shape).
This paper is well written and the quality of the open sourced code is amazing! They even have a webcam demo.
The paper is quite similar to the idea of DisNet of using different bbox features to estimate the depth of the object, using a simple MLP.
The paper is further extended by Perceiving Humans by predicting orientation and 2d bbox at the same time, for social distancing.
Key ideas
 Intrinsic task error is the localization error due to variation of height. It is estimated through the height distribution in population. (1m/20m=5% relative error).
 Uncertainty:
 Alleatoric: Laplacian prior inspired aleatoric uncertainty leads to a L1 loss term (instead of L2 term compared to a Gaussian prior)

$L = \frac{ 
1  \mu/x 
}{b} + \log(b) = e^{s} 
1\mu/x 
+ s$ 
 Note that the aleatoric uncertainty does not characterize the noise from input image, but rather the noise from the output from the joint prediction network.
 Epistemic: Monte Carlo Dropput
 Algorithm
 First step to extract 2D joints from image, this helps escaping the image domain and reduce input dimensionality.
 2D joint as input to a shallow MLP to predict the distance and associated aleatoric uncertainty.
 Geometric approach Baseline: inference using most stable keypoint segment
 Project each keypoint back to the GT distance to calculate the 3D distance of keypoint segments (headshoulder, shoulderhip, hipankle). Then the segment with smallest variance is picked to
 pifpaf gives better performance than MaskRCNN in geometric baseline. Maybe bottomup approach gives more accurate estimation.
 This is exactly what GS3D and monogrnet V2 does. This is one potential direction of improvement for keypoint based approach.
The main criterion is that the dimension of any object projected into the image plane only depend on the norm of the vector D (x, y, c) and they are not affected by the combination of its components.
Technical details
 The keypoint is projected to normalized image coordinate at z=1 away. This helps to generalize to different cameras. The joints are also zeromeaned to have more generalization.
 Topdown approach (Mask RCNN) and bottomup approach (PifPaf) yields very similar results.
 Evaluation metrics: ALP (average localization precision), recall at different distance threshold, and ALE (average localization error) in unit of meters.
 The geometric baseline
 Stereo method fails for distance after 30 meters.
Notes