Pix2seq: A Language Modeling Framework for Object Detection

May 2023

tl;dr: Formulate object detection as a language modeling task.

Overall impression

The overall intuition is that if a neural network knows about where and what the objects are, we just need to teach it how to read them out (茶壶里往外倒饺子). Pix2seq formulates object detection as a language modeling task conditioned on the observed pixel inputs.

Classic object detection methods (including DETR) explicitly integrate prior knowledge about the object detection task, especially handcraft delicate network architecture and loss functions.

It has zero inductive bias/prior knowledge of object detection. True end-to-end, as compared to ViT and DETR. Prior knowledge helps with convergence (cf DETR and its successors), but may hurt performance ceiling.

The underlying methodology of Language modeling has been shown capable of modeling various sequential data. Pix2seq enriches this portfolio and shows that it works for even non-sequential data by turning a set of objects into a sequence of tokens. And the order does not matter with random ordering working best.

Follow-up work is pix2seq v2, Unified-IO and UniTab. The author also created the self-supervised learning scheme of SimCLR.

Key ideas

Technical details