Learning-Deep-Learning

CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving

September 2025

tl;dr: Adapting a LLM to autonomous driving VLA, by building a VLM dataset.

Overall impression

Two goals of using language: 1/ diverse world knowledge and 2/ advanced reasoninig capability, to solve rare and complex scenarios.

Selling points of the paper

Draw backs of this paper

The paper reported two interesting behaviors. 1/ the subpar capability of spatial reasoning (left mistaken as right side, etc) and hallucination. 2/ the consistency between langauge and action modality (action error because of CoT error).

Key ideas

Technical details

Notes