Gt Dl L6
Readings
None!
Backwards Pass for Convolution Layer
It is instructive to calculate the backwards pass of a convolution layer even though in practice, automatic differentiation will do it for us.
- Similar to fully connected layer, will be a simple vectorized linear algebra operation!
- We will see a duality between cross-correlation and convolution
As a reminder: here is the cross correlation operation:
Some simplifications: 1 channel inp ut, 1 kernel (channel output), padding (here 2 pixels on right/bottom) to make output the same size.
The output map
- Assume size (add padding, change convention abit for convenience) - to access element of a particular pixel of the jacobian.
Just like any other layer, we will define the convolution layer as some black box that has an input, output nad a set of parameters.
Specifically it has an input
We want to compute the partial derivative of the loss, with respect to our input,
If you remember, we will use the chain rule to compute this, the partial derivative of the loss with respect to our input will just be equal to the partial derivative of the loss with respect to our output times the partial derivative of the output with respect to our input. This is saying that if we know how the loss changes as our output changes in small ways and we know how our output changes if our input changes in small ways, then we can multiply those contributions and get the ultimate change of the loss with respect to our inputs. Similarly, the partial derivative of the loss with respect to our weights, again uses this chain.
Gradient for Convolution Layer
In this lesson, we will derive the gradients for the convolution layer and show interesting characteristics derived from it, and how it can be implemented efficiently just like any other layer using linear algebra operation.
What a Kernel Pixel Affects at Output
Let’s start with the gradients for the weight update. Specifically we can do this one pixel at a time. That is, the partial derivative of the loss with respect to
So what does this weight affect at the output? Because if you remember, the forward paths induces a computation graph where this particular pixel
So, what do you think this weight affects at the output? Which of the pixels in the output does it affect? The answer is Everything!!. This is because we are striding the kernel across the input image. So in the first location, the first pixel in the input
Chain Rule over all Output pixels
Because this one kernel value affects all of the pixels on the output, we actually need to incorporate all of the upstream gradients. That is, partial derivative of loss with respect to
The way we can do this is through the chain rule. Again, if you have an outgoing arrow in the computation graph from a particular variable, and that variable affects multiple things at the output, then we will just sum the gradients where we perform the backwards pass. That is, you will have a corresponding backwards edge back to the same variable and you will add all the gradients across all the edges. And so this is what it looks like:
We will have the partial derivative of the loss with respect to
So, how do we calculate this term? You can do this analytically or visually. In this case, let’s do it visually. On the bottom, you can actually see that when we have a pixel
When we actually have the kernel pixel,
Gradients and Cross-Correlation
We are looking at the output pixel
This equation is actually a cross-correlation between upstream gradient and input until
Forward and Backward Duality
In the forward path, we are striding this kernel across the image and we have some particular kernel value that we care about
What an Input Pixel Affects at Output
Now that we have the gradients with respect to our weights, let’s calculate the gradients with respect to our input. The reason we need this is not to update the weights for this layer, but to pass back to whatever previous layer occurred before this one.
What we want to do is let’s calculate this pixel one at a time,
Extents at the Output
We can try to reason at these four extremes and the input side which parts of the output does it touch. The reason we care about this is we are applying the chain rule and we need to know which elements of the upstream gradient that we get. Should we actually use it in the chain rule?
The four extreme points on the input correspond to these four extreme points on the output. That is point four, which is where the kernel position
Summing Gradient Contributions
Now that we know all the positions on the output map that are affected, we can compute the chain. That is we can compute the partial derivative of the loss with respect to this particular input pixel
This shows the particular pixels that are affected on the output. Specifically, if you think about it, when you have kernel pixel
Lets drive it analytically this time ( as opposed to visually)
Calculating the Gradient
Definition of cross-correlation (use
Plugin what we actually wanted:
Then,
The reason is that we want the term with with
Backwards is Convolution
Plugging in to earlier equation:
This is actually just a convolution between the upstream gradient and the kernel. This can be implemented efficiently rather than performing the convolution we can actually just perform the cross-correlation, but we need to flip the kernel here. And so we can just implement the kernel flipping and cross correlation and all of these operations can be implemented via matrix multiplication.
If we perform a cross-correlation in the forward path, then the particular gradient with respect to the input is actually a convolution which is pretty interesting.
Simple Convolutional Neural Networks
Since the output of convolution and pooling layers are (multi-channel) images, we cna sequence them just as any other layer.
Take the last few layers, and if we optimize to reduce some loss, will hopefully represent more abstract features the deeper we go in this neural network.
Typically, we will take these last feature maps and feed them through fully connected layer and eventually feed it to a loss function (such as cross entropy).
One interesting aspects of alternating these types of layers is that you will have an increasing receptive field for a particular pixel deep inside the network. Again the receptive field defines what set of input pixels in the original image affect the value of this node or activation deep inside the network. If you see the depiction here, a particular pixel in the output map, in the last convolution layer here is affected by a small window around it in the previous layer. But each of those pixels in this small window are affected by some window around it in the previous layer before that, and you can keep going back over and over.
This will be important later when we start designing interesting convolution neural network architecture.
(Gaussian connections just corresponds to a mean squared error loss function - this was back then when we did not use cross entropy)
CNN that existed since 1980 and these have been processing bank checks to perform optical character recognition for quite a while now.
We will now look at other more advance architectures.
Advanced Convolutional Networks
As data availability increases, so does complexity increases and hence more complicated neural networks. An example was the ImageNet competition where neural network blew the competition away.
AlexNet
The first architecture that performed really well, is AlexNet.
Here we can see each layer laid out in terms of its dimensionality:
Key aspects:
- ReLU instead of sigmoid or tanh
- Specialized normalization layers
- PCA-based data augmentation
- Dropout
- Ensembling
VGG
VGG neural network that was very popular for a while.
Key aspects:
- Repeated application of
- 3x3 conv (stride of 1, padding of 1)
- 2x2 max pooling (stride 2)
- Very large number of parameters
Most memory usage in convolution layers.
Inception
- Has repeated blocks that are repeated over and over again to form the neural networks.
They have interesting aspects:
- They use parallel filters, that is, filters of different sizes in parallel.
- So that you can get features at multiple scales.
- Downsides is this increases computational complexity
The key idea here is we want to pick up complex features at multiple scales.
ResNet
Key idea: Allow information from a layer to propagate to any future layer (forward)
- Same is true for gradients!
Motivation behind resnet:
Eventually, as the depth of the neural networks increased, the ability to optimize them became a bottle neck. That is, as we increased the depth, we actually obtained higher error. So, researchers investigated why that was the case and it turns out that a simple modification you can make to these network architectures is to add a skip or residual connection.
That is you can see here a path that goes from the in put x through some set of transformation, some weight layers and for example a relu and that transformation is added to the identity function. That is rather than having to output a completely new thing, given the input, you are just adding residual residual elements to the original input. You are just optimizing the weights such that you make small changes to the input rather than each layer having to output something completely new.
Evolving architecture and AutoML
- Evolutionary learning and reinforcement learning
- Prune over parameterized networks
- Learning of repeated blocks typical
Transfer Learning & Generalization
Generalization
Many types of errors can happen when training neural networks:
- Optimization error
- Even if your neural network can perfectly model the world, your optimization algorithm may not be able to find the good weights that model that function.
- Estimation error
- Even if we do find the best hypothesis, this best set of weights or parameters for our neural network that minimizes the training error, that does not mean necessarily that you will be able to generalize to the testing set.
- Modeling error
- Given a particular neural network architecture, your actual model that represents the real world may not be in that space. For example there may be no sets of weights that model the real world. (For example your model vs reality)
As models gets more complicated,
- Modeling error will decrease
- higher chance of predicting the real world with a complex model
- Estimation error will increase
- over fitting more and more
- Optimization error will increase
- dynamics of our optimization will get more difficult to handle.
Transfer learning
What if we do not have enough data?
- Step1: Train on large-scale dataset
- Step2: Take your custom data and initalize the network with weights trained in step 1
- Step3: Continue to train on new dataset
- Finetune: update all parameters
- Freeze feature layer: Update only last layer weights (used when not enough data)
This works extremely well! Features learned for 1000 objects will also work well for 1001!. Generalizes even across tasks (classification to object detection).
But, his works well only if:
- If the source Dataset you train on is very different from the target dataset, transfer learning is not as effective
- If you have enough data for the target domain, it just results in faster convergence.
Effectiveness of more data
Another interesting finding is that we still have not reached a bottleneck in terms of amount of data, if we do have it. That is if we continue to add more and more data beyond the millions that we already have, performance continues to improve.
On the left, you can see that as we get to 300 million examples compared to the one million in image net, our performance on object detection as measured by a metric called mean average precision on the y axis continues to improve. You could see this both for fine tuning from image net classifier or weights to no fine tuning. In both cases you get significant improvement and the curve is still linear.
On the right, we can see some exploration where there is some irreducible error that we get to eventually, again in the log scale but this is for a particular domain. What is interesting is that there are different regimes that were identified. For example if you have too little data, then at some point, it is very difficult to decrease the error. Then you enter a power law region where essentially the training data set size in log scale continues to linearly improve the error again in log scale. And then again, at the end, you might get into some regime where you cannot reduce the error further.
Dealing with low labeled situations
Active research is still happening in terms of how far we can push in terms of reducing the number of labelled data. Unfortunately, in this field there is alot of different settings.