So eventually I got to analyse how training mini-batch size will affect a network that uses Batch Normalisation. There are several factors in play here:
- Larger batch size is good for normalisation – the more samples we normalize over, the closer the estimation is. In effect, a large mini-batch size should cause the estimations to vary less between each mini-batch.
- Smaller batch size results in more precise stochastic gradient descent steps, which may increase learning speed and final success rate.
- It is computationally cheaper to process large batches, because of the parallel nature of modern hardware (especially GPU units).
Supposedly there might be a optimal mini-batch size for Batch normalisation. In order to find it, I tested the same network again using various mini-batch sizes, observed its performance, averaged results from multiple runs, and plotted results.