As a final assignment on the Neural Networks course I took part in (University of Wrocław, Institute of Computer Science, winter2015/2016), I am tasked with designing, implementing and training a neural net that would classify CIFAR-10 images with some reasonable success rate. I am also encouraged to experiment with the network by implementing some of the recent inventions that may, in one way or another, improve my network’s performance. I will be sharing my results and observations here, in this post, and in some that will follow soon within the next two weeks.
The source code I am using for my experiments is available at github. The sources come with a number of utilities that simplify running them on our lab’s computers, which may come in handy if you are a fellow student peeking at my progress, but if you are not, then you should ignore all files except the ones within ./project directory.
I will be using Theano for implementing the network. I realize that there are various fancy frameworks like the popular Lasagne that significantly simplify the process of implementing a neural network, but I chose not to use their help. I will implement the required structures on my own, for the purposes of my network this should not require much effort, and this way I will be sure that at each point of the development I am fully aware of all internal features of my network.
I started by building a very basic network with no interesting tricks inside.
3 channels -> 32 5x5 convolutions -> ReLU -> 2x2 maxpool -> -> 64 5x5 convolutions -> ReLU -> 2x2 maxpool -> -> 128 5x5 convolutions -> ReLU -> 2x2 maxpool -> -> 128 to 512 product -> ReLU -> 512 to 10 product -> SoftMax
It did not take me much time to tune learning parameters (especially weight decay) so that this network was able to reach 76% success rate on the test set. So that was a nice starting point. The network would slightly overfit, so it felt like there was some room for improvements.
A technique that I found particularly interesting is Batch Normalisation.
The general idea behind the proposed trick is as follows. Using stochastic gradient descent training algorithm the network is trained by processing mini-batches of many samples at once. At some point between two particular layers we could use the information from an entire mini-batch to estimate the distribution of values at each neuron. We can estimate the mean and the standard deviation of the distribution that represents that particular neuron over training steps. We can then use these results to scale all the values of that neuron in the entire mini-batch so that it has mean of 0 and variance of 1.
Why is this useful? Suppose that the next layer after the point of insertion is some non-linear activation function, like tanh or sigmoid. These functions are usually interesting in a small neighbourhood of 0. At very large or very small values they won’t differentiate between inputs, so they fail to serve their purpose. Furthermore, their gradient at such values tends to be very small, and thus the network requires much more time to train if it happened that some input values of such nonlinearity are rarely crossing zero. Batch Normalisation counteracts that phenomena, by forcibly scaling the values so that they end up around zero.
There are several more details about this idea, they are very well explained in the previously linked paper.
I found this idea interesting, because the paper authors claim that implementing Batch Normalisation solves all kinds of problems. The paper states that when adding their idea to the network:
- Learning rate for the network can be significantly increased without any side-effects.
- The initial values for network parameters can be chosen much less carefully, as their scale is effectively ignored at normalisation step.
- Dropout layers become unnecessary, as they solve the same problem as Batch Normalisation.
- There is no longer need for choosing ReLUs over sigmoids, as the flat regions of the sigmoid are not a problem any more.
- Weight decay becomes much less important, because Batch Normalisation, as an apparent side-effect, regularises the network.
- The network trains as much as over 10 times faster.
So, apparently, adding Batch Normalisation is the perfect magical recipe for reinforcing any network! Turned out, it was also pretty easy to implement (as I was already using a modular approach to building a network trained using SGD).
To compare the previous network, I created a similar one that uses Batch Normalisation before each ReLU.
3 channels -> 32 5x5 convolutions -> BN -> ReLU -> 2x2 maxpool -> -> 64 5x5 convolutions -> BN -> ReLU -> 2x2 maxpool -> -> 128 5x5 convolutions -> BN -> ReLU -> 2x2 maxpool -> -> 128 to 512 product -> BN -> ReLU -> 512 to 10 product -> SoftMax
The result? Not very satisfying. The network is struggling to climb above 70% success rate on validation set. I’ve been experimenting with training parameters (as well as network architecture) for two days already, and I am unable to reach the 76% score that was achieved without Batch Normalisation.
Furthermore, the networks are massively overfitting, easily reaching 97% success rate on the train set, while staying at 60-70% success rate on the validation set. Another interesting observation is that if I choose sigmoids instead of ReLUs, the network fails to learn anything at all.
This is clearly against my expectations. I am not yet sure what could be the cause, but I will keep investigating. I hope that there is a bug in my implementation, and fixing it will magically cure the network, but realistically the problem is probably more complex than that.