So, why do ResNets work so well?

Let's go through one example that illustrates why ResNets work so well,

at least in the sense of how you can make them deeper and deeper without really

hurting your ability to at least get them to do well on the training set.

And hopefully as you've understood from the third course in this sequence,

doing well on the training set is usually a prerequisite to doing

well on your hold up or on your depth or on your test sets.

So, being able to at least train ResNet to do well on

the training set is a good first step toward that. Let's look at an example.

What we saw on the last video was that if you make a network deeper,

it can hurt your ability to train the network to do well on the training set.

And that's why sometimes you don't want a network that is too deep.

But this is not true or at least is much less true when you training a ResNet.

So let's go through an example.

Let's say you have X feeding in to

some big neural network and just outputs some activation a[l].

Let's say for this example that you are going to modify

the neural network to make it a little bit deeper.

So, use the same big NN,

and this output's a[l],

and we're going to add a couple extra layers to this network so

let's add one layer there and another layer there.

And just for output a[l+2].

Only let's make this a ResNet block,

a residual block with that extra short cut.

And for the sake our argument,

let's say throughout this network we're using the value activation functions.

So, all the activations are going to be greater than or equal to zero,

with the possible exception of the input X.

Right. Because the value activation output's numbers that are either zero or positive.

Now, let's look at what's a[l+2] will be.

To copy the expression from the previous video,

a[l+2] will be value apply to z[l+2],

and then plus a[l] where is this addition of a[l]

comes from the short circle from the skip connection that we just added.

And if we expand this out,

this is equal to g of w[l+2],

times a of [l+1], plus b[l+2].

So that's z[l+2] is equal to that, plus a[l].

Now notice something, if you are using L two regularisation away to K,

that will tend to shrink the value of w[l+2].

If you are applying way to K to B that will also shrink this although

I guess in practice sometimes you do and sometimes you don't apply way to K to B,

but W is really the key term to pay attention to here.

And if w[l+2] is equal to zero.

And let's say for the sake of argument that B is also equal to zero,

then these terms go away because they're equal to zero,

and then g of a[l],

this is just equal to a[l] because we assumed we're using the value activation function.

And so all of the activations are all negative and so,

g of a[l] is the value applied to a non-negative quantity,

so you just get back, a[l].

So, what this shows is that the identity function is easy for residual block to learn.

And it's easy to get a[l+2] equals to a[l] because of this skip connection.

And what that means is that adding these two layers in your neural network,

it doesn't really hurt your neural network's ability to do as

well as this simpler network without these two extra layers,

because it's quite easy for it to learn the identity function to just copy

a[l] to a[l+2] using despite the addition of these two layers.

And this is why adding two extra layers,

adding this residual block to somewhere in

the middle or the end of this big neural network it doesn't hurt performance.

But of course our goal is to not just not hurt performance,

is to help performance and so you can imagine that if all of

these heading units if they actually learned something useful then

maybe you can do even better than learning the identity function.

And what goes wrong in very deep plain nets in very deep network without

this residual of the skip connections is

that when you make the network deeper and deeper,

it's actually very difficult for it to choose parameters that learn

even the identity function which is why a lot of layers

end up making your result worse rather than making your result better.

And I think the main reason the residual network works is

that it's so easy for these extra layers to learn

the identity function that you're kind of guaranteed that it doesn't hurt

performance and then a lot the time you maybe get lucky and then even helps performance.

At least is easier to go from a decent baseline of not

hurting performance and then great in decent can only improve the solution from there.

So, one more detail in the residual network that's

worth discussing which is through this edition here,

we're assuming that z[l+2] and a[l] have the same dimension.

And so what you see in ResNet is a lot of use of same convolutions

so that the dimension of this is

equal to the dimension I guess of this layer or the outputs layer.

So that we can actually do this short circle connection,

because the same convolution preserve dimensions,

and so makes that easier for you to carry out

this short circle and then carry out this addition of two equal dimension vectors.

In case the input and output have different dimensions so for example,

if this is a 128 dimensional and Z or therefore,

a[l] is 256 dimensional as an example.

What you would do is add an extra matrix and then call that Ws over here,

and Ws in this example would be a[l] 256 by 128 dimensional matrix.

So then Ws times a[l] becomes 256 dimensional and

this addition is now between

two 256 dimensional vectors and there are few things you could do with Ws,

it could be a matrix of parameters we learned,

it could be a fixed matrix that just implements

zero paddings that takes a[l] and then zero

pads it to be 256 dimensional and either of those versions I guess could work.

So finally, let's take a look at ResNets on images.

So these are images I got from the paper by Harlow.

This is an example of a plain network and in which you input an image

and then have a number of conv layers

until eventually you have a softmax output at the end.

To turn this into a ResNet,

you add those extra skip connections.

And I'll just mention a few details,

there are a lot of three by three convolutions here and most of these are

three by three same convolutions

and that's why you're adding equal dimension feature vectors.

So rather than a fully connected layer,

these are actually convolutional layers but because the same convolutions,

the dimensions are preserved and so the z[l+2] plus a[l] by addition makes sense.

And similar to what you've seen in a lot of NetRes before,

you have a bunch of convolutional layers and then there are

occasionally pulling layers as well or pulling a pulling likely is.

And whenever one of those things happen,

then you need to make an adjustment to the dimension which we saw on the previous slide.

You can do of the matrix Ws,

and then as is common in these networks,

you have <unknown> pool,

and then at the end you now have

a fully connected layer that then makes a prediction using a softmax.

So that's it for ResNet.

Next, there's a very interesting idea

behind using neural networks with one by one filters,

one by one convolutions.

So, one could use a one by one convolution.

Let's take a look at the next video.