What is the point of N in NHCW for CNNs

I am struggling to understand the logic that would explain the design that CUDNN uses when calculating CNNs(convolutional neural networks). For a convolutional layer CUDNN would take the data in the NCHW (or NHWC) tensor format and than perform a convolution with a KCRS filter tensor resulting in a NKPQ tensor. In reality it performs a GEMM between the (RSC)xK filter matrix and a (NPQ)x(RSC) matrix made using Im2col resulting in a (NPQ)xK matrix which is then transposed to get a NKPQ matrix (or something similar for NHWC). I don’t see what is the advantage of this method instead of just simply calculating N times the convolution of a CHW tensor with a KCRS tensor resulting in N instances of a KPQ tensor. One could just run a single input through the entire neural network and store the partial updates of weights in VRAM and just average them after N runthroughs instead of sending an entire mini-batch of N inputs through the network at the same time. This would result in exactly the same weight updates in the same run-time but would require N times less VRAM for the activation and gradient layers. I tried benchmarking the CUDNN 3x3 convolution on a GTX1050 for a number of different values of H,W,C,K and the run-time for a given value of N was always exactly N times the run-time for N=1. So it seems to me that using any setting except N=1 would just waste memory while having exactly the same runtime speed as N times running for N=1. The only situation I found where a higher N would be useful is when H and W are really small, smaller than the usual GEMM tiling (H* W<128) but in practice that is never the case. I have tried the benchmark for explicit GEMM, implicit GEMM and Winograd and it is always the same. I would be grateful if somebody more knowledgeable about NNs could explain what is the purpose of using N>1 in NCHW convolution.

You’re right: if you run the convolution with batch size N = 1, and then repeat it N times, you can get the same result as using batch size N. And yes, it also uses less memory.

So why do we still use batch size > 1?

  1. Speed on big GPUs**
    GPUs are designed to work on big data at once. If you use only 1 image (N=1), the GPU is not fully used. With more images (N > 1), the GPU can work more efficiently.

  2. Training is faster with batches
    When training, it’s faster to process 16 images together than to run 16 times one-by-one. There’s less overhead, and gradients can be averaged automatically.

  3. Better use of GPU memory and cache
    When you use a batch, the same filters (weights) are applied to many images, so the GPU can reuse data better and avoid reloading.

  4. It’s the standard way in deep learning
    Most deep learning models are designed to use batches, because it’s efficient and simple.

But, you’re not wrong! If your GPU has very little memory, you can use N=1 and accumulate the gradients manually. This is called gradient accumulation, and it works well when needed.