CUDA motivation for multi-dimensional kernel execution

The division of work CUDA imposes into blocks is logical because it reflects the hardware (some amount of execution threads within a single execution unit, all in the same “block”).

However, as I’m looking at implementation of image processing algorithms, it’s not entirely clear why I should have 2D grids of blocks, each being a 2D grid of threads. Why won’t 1D do? After all, the kernel call usually just sees the image as a linear 1D array of pixels anyway and has to compute its global index by multiplying the usual row * column + offset in column.

One guess I have is for spatial locality. We usually compute stuff for a pixel based on the pixels around it, so the 2D grid of threads makes sure that all adjacent pixels run within the same execution unit, thus can share local memory etc. Is this correct? Anything else I am missing? Maybe ease of programming somehow (although that’s hard to believe since the code is computing a 1D offset anyway)

Thanks in advance

Two reasons. Up to compute capability 2.0 the limit of block for x component was 65000. This is easy to go over and then one needs a 2 D grid which would allow to use up to millions of blocks. In cc 3.x the limit was raised to 2 million. Second reason is that for some problems you need to work with indeces of a multidimensional matrix.

But is it not true to say that most image processing kernels receive the image as a single-dimensional array anyway (after all this is how it lives in computer memory)? In other words

uchar4* img

.

Then usually if the kernel is executed in 2D space, the global indices are used to compute an offset into that 1D image anyway, so I don’t see how this makes coding easier.

I can’t come up with a really compelling motivation, other than it probably seemed like a good idea when the transition from fixed function hardware to GPGPU was happening. Much of the focus was on image processing and linear algebra, which need to manipulate 2D coordinates, even if memory is fundamentally 1D. Maybe someone thought that exposing the tiled nature of the algorithm to the hardware would enable some future improved hardware implementation?

To my knowledge, there are no real performance benefits to using multidimensional grids in CUDA.

Consider an image processing algorithm. Some of these may be interested in lines, but most are interested in rectangular areas of the image (say, filters). If I make my threads 2D in a block, then all threads handling a certain rectangle end up on the same warp and can share memory efficiently, right?

Whereas, if I organized threads 1D in a block, then threads handling a line (rather than a rectangular block) would be able to share memory.

Is my thinking wrong here?

There are lots of situations where you need a block of threads to operate on a 2D/3D block of pixels. It could be that your data is stored in a CUDA array/surface or you might be caching the data in shared memory. This typically requires 2D/3D indices to be computed at some point. What I’m not actually certain of is whether there is actually dedicated hardware for computing the thread indices or if using a 1D thread block and computing the 2D/3D indices inside the kernel amounts to the same thing.

I’ve also come across situations where a single kernel might want to use two different sets of 2D/3D indices. For instance with convolutions you sometimes need an apron so you have to read more data into shared memory than you end up producing as output. So sometimes you end up starting with one set of 2D/3D indices, computing a 1D index and then computing a new set of 2D/3D indices.

I think there are two different questions mixed together in this thread:

  1. Is there a performance benefit to using a 2D or 3D mapping of input or output data to threads? The answer is “absolutely” for all the reasons you and others have described. If the data or calculation has spatial locality, then so should the assignment of work to threads in a warp.

  2. Is there a performance benefit to using CUDA’s multidimensional grids to do this work assignment? In this case, I don’t think so since you can do the index calculation trivially yourself at the top of the kernel. This burns a few arithmetic instructions, but that should be negligible compared to the kernel launch overhead.

This is why I think the multidimensional grids are intended as a programmer convenience rather than a way to improve performance. You do absolutely need to think about each warp’s memory access patterns, though.