PowerNorm: Rethinking Batch Normalization in Transformers

December 2021

tl;dr: A better alternative to LayerNorm in transformers (even for NLP).

Overall impression

PowerNorm significantly outperforms LayerNorm in NLP, and works for CV tasks as well. It keeps the advantage of BN (fusion into subsequent layers) during inference.

A similar work is done in BN-FFN-BN yet with a different focus.

Key ideas

Technical details