PaLM-E: An embodied multimodal language model

July 2023

tl;dr: Creation of embodied multimodal LLM through injecting continuous embodied observations (images, state estimates) into the language embedding space of a pretrained LLM.

Overall impression

This paper is similar to Gato in that they both target to become generalist agent. Yet there are two distinct differences between the two.

How to combine these two seems to rely on how to translate natural language to low level control, or to ground LLM not only on the input side, but also on the output side.

PaLM-E transfers knowledge from visual-language domain into embodied reasoning (robot planning). PaLM-E operats on multimodal sentences, or sequences of text tokens interleaved with other multimodal inputs (state estimates, visuals). Inputs such as images and state estimates are embedded into the same latent embedding as language tokens and processed by the self-attention layers of a transformer-based LLM in the same way as text.

PaLM-E is also quite data efficient. On the order of 100 examples are needed. This is great news as robotics data is significantly abundant.

Key ideas

Technical details