RPT: Robot Learning with Sensorimotor Pre-training

March 2024

tl;dr: Masked modeling to pretrain a multimodal robotic foundation model.

Overall impression

Pretraining outperforms training from scratch, transfers across tasks, labs and robots, and scales well.

This paper focusesd on masked modeling, instead of autoregressive/generative modeling. The paper also tried causal masking but still with noncausal attention, and the single task results seems about the same. This is an incomplete trial of autoregressive pretraining. –> Overall it seems that the industry is moving to generative modeling for better scalability and more available recipes. It would be nice to see a generative model variant.

Robotic data contains rich sensory and motor information (action) difficult to capture with visual pretraining alone. The unlabeled sensorimotor trajectories implicitly encode the structure of the physical world and we can use them to learn sensorimotor representaitons for downstream robotic tasks. The paper uses a masking strategy of all modalities and time with a high (70%-90%) masking ratio. This particular masking strategy is critical in encouraging the model to learn a cross-modal, spatio-temporal representations.

Note that the sensorimotor trajectory in this paper also includes images now, different from Locomotion as next token prediction which excludes images.

Key ideas

Technical details