Transformer: Attention Is All You Need

June 2020

tl;dr: Transformer architecture to get the SOTA.

Overall impression

Transformer introduced attention mechanism and successfully applied to NLP.

This is followed up by other SOTA methods in NLP such as BERT, but the idea of using attention module as the basic building block of a neural network is profound.

Attention, as opposed to memory, has constant length between any two positions. Sometimes attention is said to have “perfect memory”.

Key ideas

Technical details


# assume we have some tensor x with size (b, t, k)
x = ...
raw_weights = torch.bmm(x, x.transpose(1, 2)) # (b, t, t)
weights = F.softmax(raw_weights, dim=2) # (b, t, t)
y = torch.bmm(weights, x) # (b, t, k)