What is the point of N in NHCW for CNNs

dristic1 · July 25, 2025, 1:57pm

I am struggling to understand the logic that would explain the design that CUDNN uses when calculating CNNs(convolutional neural networks). For a convolutional layer CUDNN would take the data in the NCHW (or NHWC) tensor format and than perform a convolution with a KCRS filter tensor resulting in a NKPQ tensor. In reality it performs a GEMM between the (RSC)xK filter matrix and a (NPQ)x(RSC) matrix made using Im2col resulting in a (NPQ)xK matrix which is then transposed to get a NKPQ matrix (or something similar for NHWC). I don’t see what is the advantage of this method instead of just simply calculating N times the convolution of a CHW tensor with a KCRS tensor resulting in N instances of a KPQ tensor. One could just run a single input through the entire neural network and store the partial updates of weights in VRAM and just average them after N runthroughs instead of sending an entire mini-batch of N inputs through the network at the same time. This would result in exactly the same weight updates in the same run-time but would require N times less VRAM for the activation and gradient layers. I tried benchmarking the CUDNN 3x3 convolution on a GTX1050 for a number of different values of H,W,C,K and the run-time for a given value of N was always exactly N times the run-time for N=1. So it seems to me that using any setting except N=1 would just waste memory while having exactly the same runtime speed as N times running for N=1. The only situation I found where a higher N would be useful is when H and W are really small, smaller than the usual GEMM tiling (H* W<128) but in practice that is never the case. I have tried the benchmark for explicit GEMM, implicit GEMM and Winograd and it is always the same. I would be grateful if somebody more knowledgeable about NNs could explain what is the purpose of using N>1 in NCHW convolution.

a30582733 · July 28, 2025, 12:42am

You’re right: if you run the convolution with batch size N = 1, and then repeat it N times, you can get the same result as using batch size N. And yes, it also uses less memory.

So why do we still use batch size > 1?

Speed on big GPUs**
GPUs are designed to work on big data at once. If you use only 1 image (N=1), the GPU is not fully used. With more images (N > 1), the GPU can work more efficiently.
Training is faster with batches
When training, it’s faster to process 16 images together than to run 16 times one-by-one. There’s less overhead, and gradients can be averaged automatically.
Better use of GPU memory and cache
When you use a batch, the same filters (weights) are applied to many images, so the GPU can reuse data better and avoid reloading.
It’s the standard way in deep learning
Most deep learning models are designed to use batches, because it’s efficient and simple.

But, you’re not wrong! If your GPU has very little memory, you can use N=1 and accumulate the gradients manually. This is called gradient accumulation, and it works well when needed.

Topic		Replies	Views
NHWC vs NCHW convolution cuDNN	4	4989	January 29, 2020
cuDNN6.0: NCHW vs. NHWC GPU-Accelerated Libraries	0	2039	May 22, 2017
Cudnn may be slower? GPU-Accelerated Libraries	3	2721	September 28, 2015
Why is 2-D convolution slower than the matrix product? CUDA Programming and Performance	17	7014	April 18, 2015
cudnnSetTensor and cudnnAddTensor are extremely slow for NHWC Tensors GPU-Accelerated Libraries	0	571	March 18, 2018
Optimizing Recurrent Neural Networks in cuDNN 5 Technical Blog	17	668	August 8, 2018
TensorRT NCHW vs cuDNN NHWC TensorRT tensorrt	3	1903	August 3, 2023
Convolution incompatible with NHWC cuDNN	3	119	August 30, 2024
Convolutions in cuDNN cuDNN	1	821	July 8, 2018
The convoluion was slow in cudnn 8.4.0 and 8.4.1 cuDNN	6	948	September 17, 2022

What is the point of N in NHCW for CNNs

Related topics