February 2024
tl;dr: World model capable of multi-future video generation for autonomous driving.
Overall impression
A critical problem lies in effectively predicting the various potential outcomes that may emerge in response to the vehicle’s action as the world evolves.One possible solution is to learn a world model. A world model is a predictive model of the future that learns a general representation of the world in order to understandn the consequences of its actions (or in other words, captures expected future events). World modeling has been used as a pretraining task to learn a compact and general representation in a self-supervised way.
GAIA-1’s output is still limited to video domain. The input can be conditioned on action, making it a world model. In contrast, the follow-up work of Lingo-2 can output actions. –> Yet Lingo-2 is not built strictly on top of GAIA-1.
Note that some generative models excel at generating visually convincing content, but they may fall short in learning representaing of the evolving world dynamics that are crucial for precissse and robust decision makeing in complex scenarios. –> Sora
Why are world models useful?
- The representation by world model can significantly accelerate convergence speed for RL (and hopefully for IL as well).
- A world model can enable look-ahead search by imagining the outcomes of future action (counterfactual reasoning).
- Solves sample efficiency of RL (as each action may be costly) by acting as a neural simulator of the environment. –> This needs another agent (which can be initialized from the WM reprensentaiton) to take in the generated environment and generate action.
This technical report includes tons of training details, reminescent of the Llama series paper.
It does not explicitly predict the action, which is improved by ADriver-I.
Key ideas
- Overall performance
- GAIA-1 can perform video rollout based on a video prompt, or video generation purely from text prompt.
- GAIA-1 can reason about high level structures, and contextual awareness, and able to generate multimodal behaviors by rolling out multiple futures.
- Architecture: the video encoder, a world model and the video diffusion decoder. And of course text and motion encoder.
- The world model is an autoregressive transformer that predicts next image token conditioned on past images, text and action tokens.
- The world model reasons about the scenes high level components and dynamics. This aligns with Yan LeCun’s famous argument on performing prediction in latent space, not pixel space.
- The video decoder translates back to high quality video space, for interpretability, and also for supervision of the representation.
- Dataset: diff sampling strategy for diff components
- 200 days, 25 Hz data, 400 M images.
- For tokenizer, balance over lat, long, weather conditions.
- For WM and decoder, balance over lat, long, weather conditions, steering hehavior categories, speed behavior category) to ensure dynamics of diff behaviors are captured and sufficiently modeled.
- Model training: 3 components trained separately
- Image tokenizer: 4x32 GPU-days
- Text tokenizer: pretrained T5 to generate 32 token per time step (what?). Then mapped to d-dim.
- World model: 15x64 GPU-days
- CE loss to predict next token (out of the 2^13 vocab codebook). Predicts at 6.25 Hz, upsampled later by decoder.
- Video decoder: 15x32 GPU-days
- Trained on image and video tasks. Ensure the balance between token info in each frame and temporal consistency.
- MTL
- random token dropout to increase generalization
- Inference
- Sampling: top-k sampling. Argmax sampling will get stuck in repetivie loop.
- Text conditioning: negative prompting technique to enhance text conditioning. The guidance introduces extra hyperparameters to tune.
- Scaling
- World modeling by predicting next discrete image token exhibits clear scaling law, similar to LLMs.
Technical details
- Validation set is geofenced. Comparison of geofenced validation score and ungeofenced score to monitor the overfitting and generalization.
- FP op for one params is 6 x n_param for one forward (2x)-backward (4x) pass.
- Why VQ-VAE for image tokenizer?
- A U-Net to encode images into features and quantize by NN according to a discrete learnable codebook.
- to compress the info and reduce the sequence length needed to decribed the input data.
- Guide compression toward meaningful representations and ignore high freq signals. Use a pretrained DINO model to align features (distillation).
Notes
- How good are the geometric consistencies? Can we do a reconstruction of the static environment generated by GAIA-1 and use it to validate the geometric consistency?
- Papers to read