DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model

February 2024

tl;dr: RT-2 VLA model for autonomous driving.

Overall impression

DriveGPT4 offers one solution for end-to-end autonomous driving. It seems to be heavily inspired by RT-2, from both problem formulation to network architecture.

In a nutshell, it projects multimodal input from image, control into text domain, allowing LLMs to understand and process this multimodal data as text.

It takes in multiple single-cam images and prompts the LLM to directly output actions. It is in a sense e2e planning, without explict modules such as perception stack.

There may be many practical issues with deployment of such a system into production. See Notes session for details.

Key ideas

Technical details