RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

August 2023

tl;dr: Web-scale pretraining via VLM improves generalization over RT-1. End-to-end trained vision-language-action (VLA) model to map robot observations to actions, while enjoying common sense reasoning of VLM.

Overall impression

VLM grounds the input of LLM with reality, and VLA grounds the output of LLM with reality. VoxPoser is similar to RT-2 in that it completes the grounding of both the input and output of LLM, but RT-2 is better than VoxPoser in that this is end-to-end differentiable, and this is the first model to complete data close-loop.

Say-can and PaLM-E only addresses the high-level task planning and still relies on separate low-level controllers to carry out action primitives (“skills”). They only take the end-to-end training to planning, which is hard to form data close-loop with field data without extensive human annotation.

Gato designed new vision-language-action architecture from scratch and did not leverage the power of pretraining. The language part is only LM without explicit natural language training.

The usage of LLM has two benefits. First LLM/VLMs contain tremendous amount of common sense, and has a built-in “world-model” in it. Second the pretrained model has already amortized compute.

This builds on previous work of RT1, but leverages the power of pretrained VLMs (PaLM-E, PALI-X) to improve the generalization over unseen cases (objects, backgrounds and environments). The performance over seen cases are roughly the same.

Similar work of integrating pretrained VLMs into end-to-end visuomotor manipulation policies, include CLIPort and MOO.

RT-2 direclty outputs action as special tokens. To avoid changing network architecture and uses pretraining as much as possible, it repurposes some tokens as action tokens (either individual number tokens or least used tokens in the vocab). This grounds LLM output beyond high level plans and into control action. –> This is quite similar to the idea of extending the natural language vocab in Vision LLM. Pix2seq v2 does not use pretrained LLM and cannot output natural language tokens.

The biggest issue of RT-2 is that it is using the cerebrum 大脑 (rather than cerebellum 小脑 or brainstem 脑干) to carry out low-level actions.

Key ideas

Technical details