Exploring Neural Networks, p. 3

(see part 2 here)

So eventually I got to analyse how training mini-batch size will affect a network that uses Batch Normalisation. There are several factors in play here:

  • Larger batch size is good for normalisation – the more samples we normalize over, the closer the estimation is. In effect, a large mini-batch size should cause the estimations to vary less between each mini-batch.
  • Smaller batch size results in more precise stochastic gradient descent steps, which may increase learning speed and final success rate.
  • It is computationally cheaper to process large batches, because of the parallel nature of modern hardware (especially GPU units).

Supposedly there might be a optimal mini-batch size for Batch normalisation. In order to find it, I tested the same network again using various mini-batch sizes, observed its performance, averaged results from multiple runs, and plotted results.

Please note that all plots that follow use a logarithmic horizontal scale.

epochs_to_75

The first measure of performance, plotted above, represents the number of epochs (repetitions of the entire training set) required – on average – by the network to reach 75% success rate on validation set. This result confirms the intuition that very large mini-batch sizes make training less precise. It takes twice as much epochs to train the network using mini-batches of size 1000, than when using samples in groups of 50.

time_per_epoch

However, the wall-clock time required for processing a single epoch varies with batch size too. I’ve been training the network using a GTX 780, so large batches are preferred as they exploit hardware parallelism. It seems that at batch size 150 the GPU saturates and further increase in batch size does not yield much improvement. In that case the time for an epoch is twice shorter than with mini-batches as small as 10 samples per batch.

time_to_75

The above plot presents the actually interesting final result. By multiplying the two previous measures, we get the actual wall-clock time needed to train the network, as a combined result of all factors. This is probably the most useful measure of training speed. The result suggests that in order to achieve fastest performance, the mini-batch size should not be too large or too small, but generally as long as it is between 50 and 200, the exact size does not matter much. Apparently there is no optimal batch size for use with Batch Normalisation, the range of sizes that behave similarly is quite wide.

Potentially the batch size could also affect success rate, so that good performance would be traded off for accuracy. Initially I planned to create a similar plot that would display the relation between batch size and final test success rate, but it turned out that it is unnecessary. For all batch sizes the accuracy stayed in range of 1.5% around 78.5%, and I found no relationship between mini-batch size and success rate.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: