Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos

March 2024

tl;dr: Enables agents to learn to act by observing unlabeled videos from the internet

Overall impression

It is valuable to effectively leverage unlabeled video (no actions recorded at each frame) for sequantial decision training (robotics, AD, etc). IL is the simplest when demo are labeled with corresponding actinos. VPT and Genie adopts quite different approaches. Genie learns a latent action model jointly with dynamics model, while VPT uses a noncausal IDM to pseudolabel data. In comparison, VPT seems to work well with a specific domain, but Genie seems to learn a more general representation of action space and have better cross-domain generalization.

VPT leverages internet-scale unlabeled data for training agents in sequential decision-making domains. By combining semi-supervised imitation learning with a small amount of labeled data to train an inverse dynamics model (IDM), agents can perform complex tasks in Minecraft, achieving near or at human-level performance in some cases.

Given a fixed annotaiton budget, what is the most efficient way to spend the money? The answer provided by VPT is to train a non-causal Autolabel system and then use it to pseudolabel unlabled data. This is widely used in engineering practices such as autonomous driving. 2k hours of data only cost $2000 or lower to collect and unlock the full potential of massive unlabeled online data for use in BC.

It is MUCH (2 orders of magnitude) easier to predict an action with future information in a non-causal fashion (IDM), or with hindsight information, than do it causally (VPT). This power of hindsight can be heavily leveraged in autolabeling.

The use of a IDM model to predict action is very much like that in DriveGAN for action consistency analysis.

This is NOT a world model (as it does not predict how the world evolves), but a giant end-to-end IL agent. It is the first to report non-zero success rates on crafting a diamond pickaxe, very long into the technology tree (24k action steps later).

For a foundation model for AD, action needs to be trained in pretraining stage to incorporate knowldge, or to learn the action priors. Another way is to come up with an architecture with flexible input format that can deal with missing modes (some labled data and some unlabeled data).

Concurrently there is the best paper of NeurIPS MineDojo, which leverages advances in LLM and RAG to develop an agent. These two studies are complimentary, and VPT is much closer to a production system (浓浓的工业风美感).

Key ideas

Technical details