Wayformer: Motion Forecasting via Simple & Efficient Attention Networks

July 2022

tl;dr: Simplistic neural network architecture to handle multi-modality input for prediction network.

Overall impression

Wayformer is a simple and homogeneous network architecture to handle diverse and heterogeneous inputs to the motion forecasting task.

This paper seems to be heavily inspired by Perceiver and Perceiver IO. Both Perceiver and Perceiver IO are pioneering work on unifying multimodal input. Wayformer views the input to prediction (aka motion forecasting) networks as multi-modality (road geoemtry, lane connectivity, time-varying traffic light state, and history of a dynamic set of agents and their interactions).

Many previous work (Multipath, Multipath++) focus on handling multimodality of the output space of prediction network, while Wayformer focuses on the input space. Wayformer also builds on top of Multipath++ and uses its decoder and loss design. This suggests that motion forecasting also conforms to Occam’s Razor.

The paper also only predicts the future for a single agent. This is very different from Scene Transformer which predicts the entire scene at once (similar to the difference between single stage and two stage methods for object detection).

There are many facets of motion forecasting (behavior prediction). This can be compared with the review session of Multipath++.

Key ideas

Technical details