GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration

December 2023

tl;dr: Robot operation from human demo in a zero-shot manner. Integrating observation of human actions to facilitate robotic manipulation. First work to leverate 3rd person videos into robotic demo.

Overall impression

GPT4v_robotics extracts affordance insights from human action videos for grounded robotic manipulation. This is performed by leveraging spatialtemporal grounding with open vocab detector DETIC and focusing on hand-object relationship (grasp, release, etc).

GPT4V-robotics uses general-purpose off-the-shlef language models. It is highly flexible and can be adapted to various HW config via prompt engineering, and can benefit from the improvement of the LLM field.

Integrating vision into task planning opens up possibility of developing task planners based on multimodal human instructions (+human demo video, etc).

TAMP (task and motion planning) framework incorporates two parts. Off-the-shelf LLM decomposes human instructions into high-level subgoals. Pretrinaed skills (via IL or RL) achieves the subgoals (atomic skills).

Affordance the concept derives from the literature in psychology and cognitive science. It refers tot he potential for action that objects or situation in an environment provide to an individual (see explanation of Gibson concept in Zhihu). In robotics, it focuses on executable actions and where such actions are possible (cf Saycan, voxposer).

How the affordance information is leveraged to endow or decorate the task plan is still unclear. –> May need to read earlier works from the same authors.

Key ideas

Technical details


this is like teaching a robot to cook by showing it cooking videos. First, the robot watches the videos and uses GPT-4V to understand what’s happening - like chopping onions or stirring a pot. Then, using GPT-4, it plans how to do these tasks itself. It pays special attention to how hands interact with objects, like how to hold a knife or a spoon. This way, the robot learns to cook various dishes just by watching videos, without needing someone to teach it each step directly.