An Attention Free Transformer

September 2023

tl;dr: A new mechanism to replace dot-product attention, by introducing a learned pair-wise position bias. No attention map!

Overall impression

Conventional scaled dot-product attention mechanism has quadratic time and space complexity wrt the context size. Many previous work (such as linear attention in Transformers are RNNs) try to approximate full attention operation.

In AFT, K and V (context) are first combined together with a set of learned position biases. This step generates a reduced context, akin to the compression of dictionary. The lookup of query in this dictionary is then performed by element wise multiplication.

AFT maintains direct interaction between any two points in the context, a major advantage of dot product attention.

Key ideas

Technical details