CUDA motivation for multi-dimensional kernel execution

zlatanski · December 7, 2013, 2:36pm

The division of work CUDA imposes into blocks is logical because it reflects the hardware (some amount of execution threads within a single execution unit, all in the same “block”).

However, as I’m looking at implementation of image processing algorithms, it’s not entirely clear why I should have 2D grids of blocks, each being a 2D grid of threads. Why won’t 1D do? After all, the kernel call usually just sees the image as a linear 1D array of pixels anyway and has to compute its global index by multiplying the usual row * column + offset in column.

One guess I have is for spatial locality. We usually compute stuff for a pixel based on the pixels around it, so the 2D grid of threads makes sure that all adjacent pixels run within the same execution unit, thus can share local memory etc. Is this correct? Anything else I am missing? Maybe ease of programming somehow (although that’s hard to believe since the code is computing a 1D offset anyway)

Thanks in advance

pasoleatis · December 7, 2013, 5:09pm

Two reasons. Up to compute capability 2.0 the limit of block for x component was 65000. This is easy to go over and then one needs a 2 D grid which would allow to use up to millions of blocks. In cc 3.x the limit was raised to 2 million. Second reason is that for some problems you need to work with indeces of a multidimensional matrix.

zlatanski · December 7, 2013, 6:19pm

But is it not true to say that most image processing kernels receive the image as a single-dimensional array anyway (after all this is how it lives in computer memory)? In other words

uchar4* img

.

Then usually if the kernel is executed in 2D space, the global indices are used to compute an offset into that 1D image anyway, so I don’t see how this makes coding easier.

seibert · December 7, 2013, 7:34pm

I can’t come up with a really compelling motivation, other than it probably seemed like a good idea when the transition from fixed function hardware to GPGPU was happening. Much of the focus was on image processing and linear algebra, which need to manipulate 2D coordinates, even if memory is fundamentally 1D. Maybe someone thought that exposing the tiled nature of the algorithm to the hardware would enable some future improved hardware implementation?

To my knowledge, there are no real performance benefits to using multidimensional grids in CUDA.

zlatanski · December 7, 2013, 10:07pm

Consider an image processing algorithm. Some of these may be interested in lines, but most are interested in rectangular areas of the image (say, filters). If I make my threads 2D in a block, then all threads handling a certain rectangle end up on the same warp and can share memory efficiently, right?

Whereas, if I organized threads 1D in a block, then threads handling a line (rather than a rectangular block) would be able to share memory.

Is my thinking wrong here?

sam_hawker · December 8, 2013, 10:53am

There are lots of situations where you need a block of threads to operate on a 2D/3D block of pixels. It could be that your data is stored in a CUDA array/surface or you might be caching the data in shared memory. This typically requires 2D/3D indices to be computed at some point. What I’m not actually certain of is whether there is actually dedicated hardware for computing the thread indices or if using a 1D thread block and computing the 2D/3D indices inside the kernel amounts to the same thing.

I’ve also come across situations where a single kernel might want to use two different sets of 2D/3D indices. For instance with convolutions you sometimes need an apron so you have to read more data into shared memory than you end up producing as output. So sometimes you end up starting with one set of 2D/3D indices, computing a 1D index and then computing a new set of 2D/3D indices.

seibert · December 8, 2013, 3:57pm

I think there are two different questions mixed together in this thread:

Is there a performance benefit to using a 2D or 3D mapping of input or output data to threads? The answer is “absolutely” for all the reasons you and others have described. If the data or calculation has spatial locality, then so should the assignment of work to threads in a warp.
Is there a performance benefit to using CUDA’s multidimensional grids to do this work assignment? In this case, I don’t think so since you can do the index calculation trivially yourself at the top of the kernel. This burns a few arithmetic instructions, but that should be negligible compared to the kernel launch overhead.

This is why I think the multidimensional grids are intended as a programmer convenience rather than a way to improve performance. You do absolutely need to think about each warp’s memory access patterns, though.

Topic		Replies	Views
Using <<<...>>> CUDA Programming and Performance	6	2478	June 19, 2011
Cuda good practices for image processing CUDA Programming and Performance	8	15521	February 12, 2009
General CUDA Questions New to CUDA and need some help! CUDA Programming and Performance	8	5982	September 5, 2008
Odd performance problem/question CUDA Programming and Performance	3	835	June 3, 2009
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4497	October 24, 2008
Code optimization with CDP and dynamic shared memory allocation CUDA Programming and Performance	18	66	January 13, 2025
Grid size & performance CUDA Programming and Performance	1	819	September 27, 2016
Limitations of a CUDA kernel reached? CUDA Programming and Performance	3	4327	March 7, 2011
Doubts related to CUDA CUDA Programming and Performance	17	11811	November 18, 2010
3D grids CUDA Programming and Performance	13	10026	June 29, 2009

CUDA motivation for multi-dimensional kernel execution

Related topics