Learning-Deep-Learning

Tokenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving

June 2025

tl;dr: uses para-drive as agent-centric tokenizer to generate object level latent tokens with LLM to enhance AD.

Overall impression

Feeding sparse tokens from BEV perception to LLM seems to be an effective way to leverage LLM.

The model leverageds LLMs’s reasoning capability to enhance AV planning in long-tail scenarios.

LLM based motion planning, as prev generation of LLM-based motion planner, formulate motion planning as a language modeling problem (like GPT-driver). It converts ego stats and observations into language prompts. It depends on the quality and resolution of the scene description. Designing templates to textualize scenes requires extensive prompt engineering.

Driving with LLMs from Wayve uses explicit symbolic representations to enecode scene information, which are then used as tokens for LLMs. In comparison, TOKEN uses implicit object-level tokens.

The main challenge TOKEN solves is how to train an effecitive scene tokenizer in the low-data regime. The key answer is to leverage perception or object level representation from existing end to end driving solutions (such as PARA-Drive).

Key ideas

Technical details

Notes