VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

July 2023

tl;dr: Compose robot trajectories through a dense sequence of 6DoF end-effector waypoints. Ground LLM output to reality. It is a breakthrough in manipulation task, which uses LLM for in-the-wild cost specification.

Overall impression

LLM can be the key components to power embodied AI. Yet two gaps remain: How to condition or ground the input of LLM to reality, and how to ground the output of LLM to reality, to fully bridge the perception-action loop. The former challenge can be largely overcome by recent progress in VLM (vision language model) such as Flamingo and VisionLLM. The later challenge is tricker. VisionLLM provided some clue of how to expand LLM vocabulary to include trajectory tokens. VoxPoser provides a brand new method to generate optimization cost function landscape instead of generate ultimate trajectory. An zeroth order optimization method is used to solve the cost function.

VoxPoser extracts language-conditioned affordances and constraints from LLMs and grounds them to the perceptual space using VLMs, using a code interface and without training of either components. (Both LLM and VLM is frozen). Voxposer leverages LLMs to compose the key aspects for generating robot trajectories (value maps) rather than attempting to train policies on robotic data that are often of limited amount or variability.

One major drawback of VoxPoser is the lack of end-to-end differentiability. It is zero-shot, and no finetuning is required. This also limit the improvement close-loop of this pipeline. If we collect some corner case data in the field, there is no clear way to perform data close-loop to improve the performance of the online algorithm. –> Cf RT-2 the first end-to-end differentiable VLA (vision-language-action) model.

The model can work with dynamic perturbations, but these perturbations happen much faster than the robot can move, so essentially no active prediction needs to be performed. The robot just needs to adapt to the new environment.

This work seems to be an expansion of Code as policies, language model generated programs (LMPs). CaP still relies on manly created

Key questions: where does the LLM’s capability to generate affordance map come from?

Key ideas

Technical details