RetNet: Retentive Network: A Successor to Transformer for Large Language Models

September 2023

tl;dr: Efficient variant of Transformer to achieve training parallelism, low-cost inference while keeping good performance.

Overall impression

RetNet supports three computation paradigms: parallel, recurrent and chunkwise recurrent.

Transformers was initially proposed to overcome sequential training issue of recurrent models. Training parallelim of transformers comes at the cost of inefficient inference. It is the holy grail of efficient language modeling to achieve 1) training parallelism, 2) low-cost inference and 3) strong performance at the same time. This holy grail is also referred to as “the impossible triangle”.

Linear attention (such as Fast Transformers) approximates attention scores exp(q, k) with kernels $\phi(q)$ and $\phi(k)$ so autoregressive inference can be rerwwritten in a recurrent form. Yet the model capability and performance are worse than Transformers, hindering their popularity. –> How is RetNet better than linear attention?

RetNet = linear attention + rope + explicit exponential decay($\gamma$)

Note that the discussion of transformers in this paper is in the context of decoder-only LLMs, so self-attention.

RWKV is very similar to RetNet.

Key ideas

Technical details