Once I fixed all my bugs in Batch Normalisation implementation and fine-tuned all parameters, I started getting reasonable results. In particular, it turned out that I needed to significantly (more than 10 times) increase weight decay ratio constant. I also had to modify learning rate scheduling so that it decays much faster, this makes sense, because Batch Normalisation is supposed to speed up learning. Eventually, the network:
3 channels -> 64 3x3 convolutions -> 3x3 maxpool -> BN -> ReLU -> 128 3x3 convolutions -> 2x2 maxpool -> BN -> ReLU -> 1024 to 1024 product -> BN -> ReLU -> 1024 to 512 product -> BN -> Sigmoid -> 512 to 10 product -> SoftMax
has achieved 79% success rate on the test set.
I was interested in the advantage of using BN. To investigate it, I created another network, which is an identical clone of the one described above, but no Batch Normalisations are performed at all. Comparing the results of these two networks should express the gain introduced by using BN.