February 2024
tl;dr: LLM-based e2e driver to predict future action and vision autoregressively.
The paper proposed an interesting problem of jointly rolling out action and vision, based on past action-vision pairs. In a way it performs e2e planning by directly predicting control signals (steer angles and ego speed), like TCP and Transfuser, directly predicting planning resutls without constructing any scenre representation (at least not explicitly).
The good thing about this paper is that it jointly predicts action and video. In contrast, GAIA-1 only generates future videos but ignores control signal prediction. DriveDreamer conditions the future video generation heavily on prior information such as structured/vectorized perception results. This is taken one step further by Panacea.
The author claim that the paper creates “infinite driving”. However the prediction is in a piecewise fashion, predicting action with a world model initialized by a pretrained VLM, and then using VDM (vision difusion model) with the predicted action. The two pieces are trained separately without joint finetuning.
Another less clean design of the paper is the use of the language is a bit too heavy. There are a lot of tricks to convert the control signals to text (such as converting float to integers, and controlling num of digits), especially downgrading precise numbers to quantitative descriptions. This is not aligned with the goal of precise AD.
The biggest drawback of this paper, as I see it, is the action conditioning of the VDM. The VDM is NOT conditioned on precise quantitative action parameters but rather relied on qualitative text description. The lack of finegrained controllability makes it less powerful to act as a neural simulator. –> This needs to be improved, and qualitative metrics should be established to ensure geometric consistencies of the actions.