June 2024
tl;dr: First closed-loop world model that can output action for autonomous driving via modification of an LLM.
This is perhaps the second world-model driven autonomous drving system deployed in real world, other than FSDv12. Another example is ApolloFM (from AIR Tsinghua, blog in Chinese). Lingo-2 are more like RT2 in the sense that they piggy back on a LLM as a starting point and add multimodality adaptors to it. It is not native vision nor native action as GAIA-1. FSD v12 is highly speculated to be native vision and action.
Wayve call this model a VLAM (vision-language-action model). It improves upon the previous work of Lingo-1, which is an open-loop driving commentator, and Lingo-1-X which can outputing reference segmentations. Lingo-1-X extends vision-language model to VLX (vision-language-X) domain. Lingo-2 now officially dives into the new domain of decision making and include action as the X output.
The action output from Lingo-2’s VLAM is a bit different from that of RT-2. Lingo-2 predicts traejctory waypoints (like ApolloFM) vs actions (as in FSD).
The paper claims that is is a strong first indication of the alignment between explanations and decision-making. –> Lingo-2 is outputing driving behavior and textual predictions in real-time, but I feel the “alignment” claim needs to be examined further.
Language opens up new possibilities for accelerating learning by incorporating a description of driving actions and causal reasoning into the model’s training. In addition, natural language interfaces could, even in the future, allow users to engage in conversations with the driving model, making it easier for people to understand these systems and build trust.