March 2024
tl;dr: World model for autonomous driving, conditioned on structured traffic constraits.
First real-world world model, contemporary with GAIA-1. Yet the controllability of the dynamics is different.
Typically the controllability of world model is only quantitative as it is hard to do (close to) pixel accurate generation with difussion models. DriveDreamer alleviates this problem and reaches near pixel accurate control with structured traffic constraints (vectorized wireframes of perception results, or perception vectors
for short). This inspiration may be taken from Align Your Latents.
The model takes in video, text, action and perception vectors, and rolls out videos and actions. It can be seen as a world model as the video generation is conditioned on action.
The dynamics of the world model is actually controlled by a simplistic RNN model, the ActionFormer, in the latent space of the perception vectors
. This is quite different from GAIA-1 and Genie where the dynamics are learned via compressing large amounts of video data.
The model is mainly focused on single cam scenarios, but the authors demo’ed in the appendix that it can be easily expanded to multicam scenario. –> The first solid multicam work is Drive WM (Drive into the Future).
WorldDreamer from the same group seems to be the extension of DriveDreamer. Yet disappointingly WorldDreamer seems unfinished and rushed to release on Arxiv, without much comparison with contemporary work.