March 2024
tl;dr: Multimodal world model via masked token prediction.
Overall impression
The model takes in a variety of modalities such as image/video, text, actions, and generate videos conditioned on these multimodal prompts.
World models hold great promise for learning motion and physics in the genral world, essential for coherent and reasonable video generation. It draws strong aspiration from VideoPoet and adds action condition on top of it, making VideoPoet a world model.
WorldDreamer seems to be the extension of DriveDreamer. Yet disappointingly WorldDreamer seems unfinished and rushed to release on Arxiv, without much comparison with contemporary work. The paper is also heavily inspired by MaskGIT, especially the masked token prediction and parallel decoding.
Key ideas
- Architecture
- Encoder
- Vision: VQ-GAN, vocab = 8192
- Text: pretrained T5, similar to GAIA-1.
- Action: MLP
- Text and action embedding can be missing.
- Masked prediciton
- Decoder: Parallel decoding
- Training with masks.
- Dataset: triplet (visual, text, action), but also supports data with missing modalities.
- Inference: parallel decoding
- DIffusion: requires ~30 steps to reduce noise
- Autoregressive: needs ~200 steps to iteratively predict next token
- Parallel decoding: video generation in ~10 steps.
Technical details
- The key assumption underlying the effectiveness of the parallel decoding is a Markovian
property that many tokens are conditionally independent given other tokens. (From MaskGIT and Muse)
- PySceneDetect to detect scene switching
- The idea of using masked language model for video prediction is first proposed in MaskGIT, then extended by Muse to text-to-image generation. During training, MaskGIT is trained on a similar proxy task to the mask prediction in BERT. At inference time, MaskGIT adopts a novel non-autoregressive decoding method to synthesize an image in constant number of steps.
Notes
- Questions and notes on how to improve/revise the current work