Pix2seq v2: A Unified Sequence Interface for Vision Tasks

June 2023

tl;dr: Extension of Pix2seq to multiple core vision tasks.

Overall impression

The paper showed that a diverse set of core computer vision tasks can also be unified if formulated in term s of a shared pixel-to-sequence interface. Such tasks includes object detection (the only task supported by pix2seq), instance segmentation, keypoint detection, image captioning.

The formulation of various vision-centric tasks have significant differences in the form of the outputs, customized models with specialized architectures and loss functions are designed for each task. In order to unify them into one single model, a unified interface has to be created.

Pix2seq_v2 expands on pix2seq and defines a new paradigm. Pix2seq is a conditioned sequence generation task (conditioned on heterogeneous features other than this token language). Pix2seq_v2 is a conditioned sequence completion task.

Many vision-centric tasks can be treated as image captioning or visual question answering (VQA) in a specific language dialect, a language spoken in the format of specific json schema, for example.

This work is further extended by Vision LLM which leverages the pretrained LLM.

Key ideas

Technical details