RoboVQA: Multimodal Long-Horizon Reasoning for Robotics

December 2023

tl;dr: RoboVQA proposes a scalable, bottom-up and intrinsically diverse data collection scheme. Egocentric human data helps.

Overall impression

RoboVQA demonstrates a substantial gap in performance for zero-shot SOTA LLM/VLM models compared to the model finetuned on real embodied data. This indicates a critical need to collect tons of grounded data. RoboVQA proposes a scalable bottom-up data collection scheme.

RoboVQA proposes a bottom-up long-horizon data collection, as compared to the tradiational top-down step-by-step data collection. It also proposes to use humans to collect egocentric videos. By combining the two, RoboVQA can achieve 14x speedup. The collected data are also highly diverse, relevant and on-distribution for users. (采集数据又快又好)

Key ideas

Technical details