So eventually I got to analyse how training mini-batch size will affect a network that uses Batch Normalisation. There are several factors in play here:
- Larger batch size is good for normalisation – the more samples we normalize over, the closer the estimation is. In effect, a large mini-batch size should cause the estimations to vary less between each mini-batch.
- Smaller batch size results in more precise stochastic gradient descent steps, which may increase learning speed and final success rate.
- It is computationally cheaper to process large batches, because of the parallel nature of modern hardware (especially GPU units).
Supposedly there might be a optimal mini-batch size for Batch normalisation. In order to find it, I tested the same network again using various mini-batch sizes, observed its performance, averaged results from multiple runs, and plotted results.
Please note that all plots that follow use a logarithmic horizontal scale.
The first measure of performance, plotted above, represents the number of epochs (repetitions of the entire training set) required – on average – by the network to reach 75% success rate on validation set. This result confirms the intuition that very large mini-batch sizes make training less precise. It takes twice as much epochs to train the network using mini-batches of size 1000, than when using samples in groups of 50.
However, the wall-clock time required for processing a single epoch varies with batch size too. I’ve been training the network using a GTX 780, so large batches are preferred as they exploit hardware parallelism. It seems that at batch size 150 the GPU saturates and further increase in batch size does not yield much improvement. In that case the time for an epoch is twice shorter than with mini-batches as small as 10 samples per batch.
The above plot presents the actually interesting final result. By multiplying the two previous measures, we get the actual wall-clock time needed to train the network, as a combined result of all factors. This is probably the most useful measure of training speed. The result suggests that in order to achieve fastest performance, the mini-batch size should not be too large or too small, but generally as long as it is between 50 and 200, the exact size does not matter much. Apparently there is no optimal batch size for use with Batch Normalisation, the range of sizes that behave similarly is quite wide.
Potentially the batch size could also affect success rate, so that good performance would be traded off for accuracy. Initially I planned to create a similar plot that would display the relation between batch size and final test success rate, but it turned out that it is unnecessary. For all batch sizes the accuracy stayed in range of 1.5% around 78.5%, and I found no relationship between mini-batch size and success rate.