MoCo: Momentum Contrast for Unsupervised Visual Representation Learning

February 2020

tl;dr: Use momentum updates to improve upon previous work on unsupervised pretraining.

Overall impression

The whole field of unsupervised visual representation learning is gaining more attention due to recent success in unsupervised pretraining in NLP (BERT, GPT). The framework of the problem is not new, and the paper largely inherits the set up from previous work (e.g. InstDisc). This paper draws a lot of inspiration from InstDisc paper, and can see as an immediate update to it.

The main contribution of this paper is the update of the encoder of the dictionary. InstDisc proposed the dictionary which decoupled the batch size and the dictionary size. However it maintains and updates the representation of each instance separately. MoCo instead updates the encoder. The momentum updating rule reminds me of the target network trick in DQN. (The trick of caching a historical version of the model is also mentioned in the discussion thread in 知乎)

Key ideas

Technical details