skip connection

_{scroll ↓ to Resources}

Note
- [[#Note#Reasoning|Reasoning]]
- [[#Note#How|How]]
- [[#Note#Result|Result]]
- [[#Note#Advantages|Advantages]]
Resources

Note

Introduced for the original ResNet

Reasoning

A tensor comes in, and the layer does some processing to change that tensor. If the layer only needs to change the input by small amounts, or only in some places, then it would be wasteful to expend resources processing the parts of the tensor that don’t need to change.
Allows training very deep networks
Also see training degradation:

training degradation

Notes

Authors of [1512.03385] Deep Residual Learning for Image Recognition noticed that deeper networks not always give better testing (but also training) accuracy. why is that?

==can be an overfitting issue?==

No, because then the training set error would be lower for the larger network.

56-layered network could have 36 “empty” layers not doing anything and then 20 extra layers which would do the job, but that’s not happening with “plain” networks

to prevent training degradation the authors proposed to use the skip connection method

Link to original

How

Residual connections propagate the inputs to the output of one or more layers. This has the effect of making the optimization procedure easier to learn and also helps deal with vanishing and exploding gradients.
The skip connection works because each layer is trying to reduce its own contribution to the final error, while participating in the network made up of all the other layers. The skip connection is part of the network, so the layer learns it doesn’t need to process the parts of the tensor that don’t change. This makes the layer’s job simpler, enabling it to be smaller and faster.

Result

there are more shorter paths, than longer paths. During backpropagation the gradients with the highest magnitude take shorter paths on average than the gradients with lower magnitude. That is, most of the training is happening through shorter paths.
- on practice, path length our training info goes through is shorter by 2
representations change slowly, there is less differences in parameters on two sides of each layer.
- dropping single layers on networks with skip connection has lesser effect than on networks without those
loss function surface becomes smoother and easier to optimize

Advantages

Makes layers that are smaller and faster, and it even improves the flow of gradients in backpropagation, which lets us efficiently train networks of dozens or even hundreds of layers
weights evolve slowly ⇒ good for interpretations
applied in advanced architectures like transformer and ResNet-architecture family
transformer use skip connections not just for efficiency and speed, but also because they allow the transformer to cleverly keep track of the location of each element in its input.

Resources

[1512.03385] Deep Residual Learning for Image Recognition - the main\first paper
Modern Deep Learning (2023). Лекция 10. Skip Connections

Links to this File

table file.inlinks, file.outlinks from [[]] and !outgoing([[]])  AND -"Changelog"

Notitia Restante 🌱

On this site

skip connection

Contents

Note

Reasoning

Notes

How

Result

Advantages

Resources

Links to this File

Graph View

On this page

Backlinks