CAM-Convs: Camera-Aware Multi-Scale Convolutions for Single-View Depth

November 2019

tl;dr: Extension of coordConv to make DL perception model invariant to cameras.

Overall impression

Usually models are overfitted to a benchmark, with train/test data from the same camera. The generalization to an unseen camera (with unseen sensor size and intrinsics) is poor. This prevents us from using data taken from other cameras to train the data-hungry deep learning model.

The idea is quite clever in how to embed/encode the sensor size (pixel number format) and focal length into convolution.

Embedding meta data info into conv:

The basic idea is to convert meta data into pseudo-images suitable for CNNs. Cam conv precomputes pixel wise coord map and fov maps and concat them to original input.

This idea is also extended in PETR, which bridges 2D images with 3D reasoning.

BEVDepth also proposed a way to do camera aware depth prediciton, with calibration features and Squeeze-and-Excitation-module.

Key ideas

Technical details