Multi-view Convolutional Neural Networks for 3D Shape Recognition (MVCNN)

Mar 2019

tl;dr: Aggregate CNN features from multiple 2D projections of a 3D object to obtain a high quality 3D feature.

Overall impression

The idea of using pre-trained and fine-tuned CNN to extract 2D features has been widely used. This paper explores ways to effectively aggregate these 2D features (concatenation, average or max pooling) into a high quality feature for the 3D object. Given that this feature should be insensitive to the number of 2D projections and the permutation of the list of 2D features (orderless list), max pooling of 2D features across views seems a very natural choice. In addition, the learning of a low-rank Mahalanobis metric significantly boosts the retrieval performance.

Key ideas

Technical details