VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

May 2023

tl;dr: Use pretrained LLM as flexible and unified manager for a variety of vision tasks.

Overall impression

All vision foundation models are restricted to tasks in a pre-defined form, struggling to match the open-ended task capability. VisionLLM aims to flexibly manage vision-centric tasks (obj det, instance seg, etc) with language instructions.

The main contribution is in the decoder, and seems to be a more user-friendly way to manage/unify multiple tasks than pix2seq v2, and can generate to multi-modal input. Also, VisionLLM uses pretrained LLM, where pix2seq is trained from scratch.

The “output format as query” trick seems a nice way to speed up inference, but it breaks the beauty of the next-token prediction paradigm and has to resort to inductive bias or prior knowledge of specific tasks.

VisionLLM seems to be the only work that can tap into the potential of a pretrained LLM, and also able to output fine-grained control signal (such as bbox coordinates). –> This is a great inspiration to leveraging LLM for prediction and planning for autonomous driving.

Key ideas

Technical details