DriveVLM: The convergence of Autonomous Driving and Large Vision-Language Models

February 2024

tl;dr: A hybrid system to leverage VLM for long-tail scenario understanding.

Overall impression

In a nutshell, DriveVLM maps vision observation into language space and ask a wise blind man (LLM) to perform the driving task.

Urban driving is challenging due to long-tail scenario understanding including rare objects, rare scenes and negotiation with road agents. (Past planning stack focuses on trajectory level action and neglecting the decision level interactions. This is what makes autonomous driving systems unnatural/misaligned with experienced human driver.)

DriveVLM uses VLM to tackle all these issues. It uses VLM’s generalization capability to perform rare object recognition out-of-the-box, and uses LLM reasoning for intention level predition and task-level planning. The power of VLM is harnessed through a carefully designed CoT. –> In this sense, it is still a modular design, but the system is constructed through prompting, making it more similar to modular e2e approaches such as UniAD than e2e planning.

DriveVLM-dual is a hybrid system that leverages the zero-shot and reasoning power of VLM, and also suplement the short commings of VLM (such as 3D grounding and long latency) by traditional AD stack.

The system seems to be more explanable than e2e planning task, as all VLM outputs can be logged for analysis and debugging.

The idea of the paper is very solid and production-oriented. It is also well-executed and validated, through a customized dataset. However the paper is not well written, with lots of important details missing for the understanding, let alone reproduction, of this work. See the Notes section.

Key ideas

Technical details