Humanoid Locomotion as Next Token Prediction

March 2024

tl;dr: A motion controller based on next token prediction of sensorimotor tokens for bipedal humanoid locomotion. The most obvious advantage over prev methods is scaling.

Overall impression

The paper tackles humanoid control, specifically humanoid locomotion (standing upright and moving legs) as an e2e control problem. The sequence of sensory observations and motor actions makes up sensorimotor trajectories, as the sentence of the physical world. Note that there is NO images or perception involved, but only streams of relatively sparse, structured output.

This paper, alongside RT1, signifies a new era of Big Data through Transformers as a Control Policy. Model based control –> DRL (too fragile to OOD, complex corriculum) –> Transformers

A causal transformer is trained with immitation learning via autoregressive prediction of sensorimotor trajectories. The input to the autoregressive model is a pair of tokens of observation (joint encoders, IMUs) + action (motor commands).

Robotics data is diff from language as robotics data are naturally multimodal and high dimensional. In order to scale up training, it is inevitable to deal with missing modality during training.

It rolls out observation and action jointly, and in this way, it is a world model of sensorimotor input. The state-action prediction yields better results than training with action-only prediction. The joint pred task forces the model to learn richer representation of the world that are beneficial for action prediction. –> This is why we need World Model.

As the base models get exponentially more expensive to train, all researchers (no matter what institution you are at) will face the same engineering constraint: there is only enough resources to train the biggest model once. All post-training capabilities need to be derived from that base model, and because it’s hard to anticipate what the downstream tasks look like, you must prepare the base model for all possible tasks. In other words, your foundation model’s training objective should be for the full generative model of the data, such as an autoregressive next-token predictor (e.g. GPT) or a diffusion process (e.g. a video generative model like Sora), or some combination of the two. If you throw your base model budget on a conditional density modeling problem, e.g. “predict all the robot actions from the video”, it might not be a good base model for many tasks that you might care about later. This only becomes more true as the cost of the base model grows. —- All Roads Leads to Robots

The model can transfer to real world when trained with ONLY 27 hours of data. Another interesting fact is that the transformer based policy is smoother and more accurate than the RL policy, although the model is trained with trajectories produced by this RL policy. (青出于蓝? Why?)

Locomotion as next token deals with missing modality with masked modeling, Genie with LAM (latent action space), and VPT with IDM (inverse dynamics model).

Key ideas

Technical details


Notes taken during tech sharing with co-author