Learning-Deep-Learning

Drive-WM: Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving

February 2024

tl;dr: First consistent, controllable, multiview videos generation for autonomous driving.

Overall impression

The main contribution of the paper is multiview consistent with video generation, and the application of this world model to planning, through a tree search, and OOD planning recovery. –> DriveDreamer also briefly discussed this topic in its appendix.

Drive-WM generates future videos, conditioned on past videos, text, actions, and vectorized perception results, x_t+1 ~ f(x_t, a_t). It does NOT predicts actions. In this way, it is very similar to GAIA-1, but extends GAIA-1 by multicam video generation. It is also conditioned on vectorized perception output, like DriveDreamer.

[Def] In a broad sense, a world model is a model that learns a general representation of the world and predicts future world state resultsing from a seq of actions. In the sense of autonomous driving and robotics, a world model is a video prediciton model conditioned on past video and actions. In this sense, Sora generates videos conditioned on input of text and video. But qualitative actions can be expressed as text, so Sora can also quality as a world model. The usage of world model is two fold: to act as a neural simulator (for closed loop training), and to act as a strong feature extractor for policy tuning finetraining.

Video prediction can be regarded as a special form of video generation, conditioned on past observation. If the video prediction can be controled by actions (or in the qualitative form of text), then the video prediction model is a world model.

For a (world) model that does not predict actions, it may act as a neural simulator, but it may not generate enough representation to be finetuned for policy prediction.

It seems that the world model heavily depend on external crutches such as view factorization, and BEV layout. It does NOT learn geometric consistencies through large model traning like GAIA-1 or Sora.

Key ideas

Technical details

Notes