Learning-Deep-Learning

RT-1: Robotics Transformer for Real-World Control at Scale

August 2023

tl;dr: A generalist robotic control agent, powered by a data-absorbent model architecture, trained on diverse and large-scale robotics dataset.

Overall impression

The key to foundation models and LLMs lies in the open-ended, task agnostic training. (GBB is deadend.) RT-1 aimed to be the LLM for robotic control. RT-1 input is language and vision observations, and maps them to robot actions. It learns robot policies to solve language conditioned tasks from vision.

Robotic dataset is more expensive to collect (requires engineering-heavy automation, or human demo), and thus training a multitask backbone model is of importance. –> Synthetic data is perhaps another way?

RT-1 focuses on low-level control tasks (skills), such as picking up an apple. RT-1 is not able to do high level task planning. In say-can terminology, RT-1 focuses on the can part, not the say part. This is perhaps why RT-1 does not use pretrained LLM/VLM model, and RT-2 may be an overkill. (Based on the assumption that RT-2 cannot do high level task planning.) RT-1 does not use pretrained LLM and thus is quite similar to Gato and Mile, and the world-model papers (such as daydreamer and dreamer series).

RT-1 exhibits good performance and generalization capability, and it can perform 97% seen tasks and 75% unseen tasks. Much better generalization capability than Gato. High capacity model enables generalization. Transformer is such an architecture. Real time (RT) performance also requires efficient architecture. Pun intended? (In contrast, RT-2 is quite heavy and not that RT.)

Data is more important than model architecture, and data diversity is more important than data quantity. Breadth > Scale. Tasks should be well connected as well.

The main contribution is the large dataset. But how effective is this dataset when the end effector is changed? Perhaps not that useful.

Key ideas

Technical details

Notes