Exploring Neural Networks, p. 3

(see part 2 here)

So eventually I got to analyse how training mini-batch size will affect a network that uses Batch Normalisation. There are several factors in play here:

  • Larger batch size is good for normalisation – the more samples we normalize over, the closer the estimation is. In effect, a large mini-batch size should cause the estimations to vary less between each mini-batch.
  • Smaller batch size results in more precise stochastic gradient descent steps, which may increase learning speed and final success rate.
  • It is computationally cheaper to process large batches, because of the parallel nature of modern hardware (especially GPU units).

Supposedly there might be a optimal mini-batch size for Batch normalisation. In order to find it, I tested the same network again using various mini-batch sizes, observed its performance, averaged results from multiple runs, and plotted results.

Read the rest of this entry »