Perceiver: General Perception with Iterative Attention

August 2021

tl;dr: A general architecture to model arbitrary multimodal input.

Overall impression

The paper proposes a general transformer architecture to model multimodal inputs such as image, video, audio, point cloud, etc. Transformers have been rapidly percolating into perception.

Transformer has the quadratic scaling problem and thus cannot handle very large inputs.

It still focuses on classification task, but it builds a foundation for other type of higher level tasks such as object detection, segmentation, etc.

Efficient implementation of transformers include Set Transformer and Linformer. See Efficient Transformers: A Survey for a review. Perceiver is more scalable than linear as it decouples the computation from length of the input.

The idea sees great potential in end2end autonomous driving (ViP3D and UniAD).

Key ideas

Technical details