Once I fixed all my bugs in Batch Normalisation implementation and fine-tuned all parameters, I started getting reasonable results. In particular, it turned out that I needed to significantly (more than 10 times) increase weight decay ratio constant. I also had to modify learning rate scheduling so that it decays much faster, this makes sense, because Batch Normalisation is supposed to speed up learning. Eventually, the network:
3 channels -> 64 3x3 convolutions -> 3x3 maxpool -> BN -> ReLU -> 128 3x3 convolutions -> 2x2 maxpool -> BN -> ReLU -> 1024 to 1024 product -> BN -> ReLU -> 1024 to 512 product -> BN -> Sigmoid -> 512 to 10 product -> SoftMax
has achieved 79% success rate on the test set.
I was interested in the advantage of using BN. To investigate it, I created another network, which is an identical clone of the one described above, but no Batch Normalisations are performed at all. Comparing the results of these two networks should express the gain introduced by using BN.
Obviously, I had to re-tune learning parameters for the new network. This included decreasing weight decay, decreasing learning rate decay. The best set of parameters that I found for that network got it to reach 74% success rate on the test set, which is slightly smaller than the same network with Batch Normalisation. You can find the exact parameter values in my publicly available source code.
The interesting part was to plot the learning history for both networks on the same chart.
The red lines correspond to the network with Batch Normalisation enabled. The blue lines correspond to the same network without normalisation. The horizontal axis represents training epoch, i.e. the number of repetitions of the entire training set that the network was fed to train (in a random order). Solid lines display error rates on validation sets, dotted lines display training batch error rates. To smoothen them out a bit, I filtered batch error rates with a moving average of order 1000.
An immediate observation is that Batch Normalisation significantly speeds up training process, i.e. less data has to be fed to the network in order to train it. The reason for this is that normalisation reduces internal covariate shift, which normally leads to oversaturating non-linearities in the network, which decreases gradient values and slows down learning.
Another fact is that the first network reached observably better results on the validation set. With Batch Normalisation enabled, all ReLUs in the network are activated by, on average, 1/2 of samples. When normalisation is not performed and a strong covariate shift is present, then some ReLUs will only have large (positive or negative) values on their input, and thus they will process each sample identically. Such ReLUs fail to introduce any non-linearity, and therefore become useless for the network. Enabling normalisation before non-linearities ensures that each of them stays useful.
I also find it interesting that the train batch error rate with batch normalisation enabled clearly oscillates with a frequency strictly correlated to the training set size. I have currently no idea on what might cause this, but it seems like understanding that phenomena could lead to further improvements.
What is important in the above comparison is that the chart is aligned by epoch number. However, the second network (blue, BN disabled) obviously performs less calculations, as it does not have to estimate means and deviations at every other layer. This means it’s computationally cheaper, and therefore will work faster when measuring the train speed using wall-clock time. Below is an alternative chart, where the horizontal axis represents time spend training (in seconds).
With Batch Normalisation enabled each epoch takes much more time, as more computations are required. Taking that into account, the advantage of BN-enabled network is not as huge as it appeared to be. There clearly is a significantly smaller error rate at the same point in time for the BN-enabled network, so it still outdoes the other model, but the wall-clock train time is similar.
The result presented above may be slightly skewed, though. For the network with Batch Normalisation enabled I have used the optimal batch-size (determined experimentally), which is 10. The other was trained in batches of 150 samples. Since I am performing computations using the GPU, there is a significant overhead of pre-processing the sample to be used by GPU which becomes a big cost then the sample size is small. And that is probably what costs me the most in case of the BN-enabled network.
It will be a logical next step to see how can I improve the training speed by slightly increasing batch size. I expect that it will slightly decrease the final success rate of the network, but should significantly decrease training time, so that Batch Normalisation will beat the other network even when measured by a real-time clock.