layer normalization

_{scroll ↓ to Resources}

Note

Each layer in a transformer, consisting of a multi-head attention module and a feed-forward layer, employs layer normalization and residual connection.
belongs to regularization for neural network, which help control overfitting by keeping the values flowing through the network from getting too big or too small.
- The layer norm step learns to adjust the values coming out of a layer so that they approximate the shape of a Gaussian with a mean of 0 and standard deviation of 1.
A popular approach places the layer norm just before the addition step of a skip connection. Since these two operations always come in pairs, it’s convenient to combine them into a single operation that we call norm-add.
see also batch normalization
requires initialization

table file.inlinks, file.outlinks from [[]] and !outgoing([[]])  AND -"Changelog"