Why and when to use dim3 grid instead of just a 1d blocks and grids?

I am trying to understand why there exists 3 dimensions for cuda blocks and grids and also when to use higher dimensions.

If I have an Image of 2d pixels and I flatten it to run on the kernel -

cudaMemcpy( dev_img, h_img, num_row * num_col * sizeof(float), cudaMemcpyHostToDevice)

I can run my kernel with
num_of_grids = (num_row * num_col + 512 - 1)/512
myKernel<<< num_of_grids, 512>>>(…)

So I can just modify my flattened array in my kernel with img[gid]

what is the advantage of doing
dim3 gridSize(num_of_grids, 1)
dim3 blockSize(16, 16)
myKernel<<< gridSize, blockSize>>>(…)

and modify the flattened array with
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int gid = num_col * y + x;
img[gid]

or any other 2d allocation rather than the previous method.

I don’t understand why we don’t just work with 1d array both in host and on the gpu since we flatten? when should we run kernel with 2d blocks and 2d grids rather than a 1d block and 1d grid?

Thank you

Hi obidarefolu,

first: For some manipulations 2D coordinates (vs. flattened 1D) is important. E.g. you want to average/smooth your image with a 3x3 filter. Then to find the upper and lower pixel, you have to go up or down by one row.

On the other hand, you can create your own n-dimensional coordinates from your 1D index within the kernel.

int idx = blockIdx.x * blockDim.x + threadIdx.x;

int z = idx / (lenx * leny);
int y = (idx - z * lenx * leny) / lenx;
int x = idx - z * lenx * leny - y * lenx;

Mostly being able to use dim3 for block and grid size is just convenience and a small bit of performance improvement as divisions can be slow.

Within kernels you also can have different code sections with different usage of the index:

section 1: Read 2D data into shared memory.
section 2: Do 5D calculations (including reading 2D data, reading more data in 5D, and dimension reduction by adding up into 3D shared memory).
section 3: Write back 3D data.

Those sections typically are separated by block-wide synchronization calls.

Your kernels can also use for loops to give you even more flexibility.

For highly optimized code, those freedoms help to have good memory access patterns, good usage of the caches and maximal occupancy of the compute units.

Some of my kernels have very complex expressions to create indices with bitmasks, shifts and combining threadIdx, blockIdx and loop variables. And the index would sometimes change, whether I access global memory or shared memory. A different kernel accessing the same global data may use a different indexing scheme.

So there is no single right or wrong way to index your data. There are many right, sub-optimal or wrong ways. First try to create a correct kernel with the usual addressing and then learn, how to refine and optimize.

BTW: In the end all multi-dimensional memory accesses are 1D.

1 Like

Have a look here:

https://www.cs.rochester.edu/~sree/fermi-tbs/fermi-tbs.html

Note the red serpentine path in one of the figures.

“Further, for 2-D grids, it [the Thread Block Scheduler] picks blocks in a pattern resembling well-known space-filling curves, perhaps to preserve 2-D locality.”

So a 2-D grid may use the cache more efficiently than a 1-D grid.

That’s Fermi. Pretty old. I don’t know how reliable it is for a modern architecture. I ran a slightly modified version of this on a H100 and I don’t see evidence of a serpentine curve.

$ cat t8.cu
#include <stdio.h>


__device__ unsigned long long globaltime(void)
{
    unsigned long long time;
    asm("mov.u64  %0, %%globaltimer;" : "=l"(time));
    return time;
}

__device__ unsigned int smid(void)
{
    unsigned int sm;
    asm("mov.u32  %0, %%smid;" : "=r"(sm));
    return sm;
}

__global__ void logkernel(void)
{
    unsigned long long t = globaltime();
    printf("Started block %2u,%2u on SM %2u at %llu.\n", blockIdx.x, blockIdx.y, smid(), t);
}


int main(void)
{
    logkernel<<<dim3(8,8),1>>>();
    cudaDeviceSynchronize();

    return 0;}
$ nvcc -o t8 t8.cu -arch=sm_90
$ ./t8
Started block  1, 7 on SM 14 at 1717272791063635040.
Started block  5, 7 on SM 76 at 1717272791063635040.
Started block  4, 3 on SM  6 at 1717272791063635072.
Started block  4, 7 on SM 62 at 1717272791063635040.
Started block  5, 3 on SM 22 at 1717272791063635072.
Started block  2, 7 on SM 30 at 1717272791063635072.
Started block  7, 3 on SM 54 at 1717272791063635072.
Started block  7, 7 on SM 104 at 1717272791063635040.
Started block  0, 4 on SM 68 at 1717272791063635072.
Started block  6, 7 on SM 90 at 1717272791063635072.
Started block  2, 4 on SM 96 at 1717272791063635040.
Started block  3, 4 on SM 110 at 1717272791063635072.
Started block  3, 7 on SM 46 at 1717272791063635072.
Started block  1, 4 on SM 82 at 1717272791063635072.
Started block  6, 3 on SM 38 at 1717272791063635072.
Started block  1, 6 on SM 100 at 1717272791063635104.
Started block  2, 5 on SM 98 at 1717272791063635104.
Started block  6, 5 on SM 58 at 1717272791063635104.
Started block  7, 4 on SM 56 at 1717272791063635104.
Started block  3, 5 on SM 10 at 1717272791063635104.
Started block  4, 4 on SM  8 at 1717272791063635104.
Started block  0, 5 on SM 70 at 1717272791063635104.
Started block  5, 4 on SM 24 at 1717272791063635104.
Started block  7, 5 on SM 72 at 1717272791063635104.
Started block  2, 2 on SM 92 at 1717272791063635040.
Started block  7, 1 on SM 50 at 1717272791063635040.
Started block  4, 1 on SM  2 at 1717272791063635040.
Started block  4, 5 on SM 26 at 1717272791063635104.
Started block  1, 5 on SM 84 at 1717272791063635104.
Started block  0, 6 on SM 86 at 1717272791063635104.
Started block  3, 2 on SM 106 at 1717272791063635040.
Started block  0, 2 on SM 64 at 1717272791063635072.
Started block  5, 1 on SM 18 at 1717272791063635072.
Started block  6, 4 on SM 40 at 1717272791063635104.
Started block  5, 5 on SM 42 at 1717272791063635104.
Started block  2, 3 on SM 94 at 1717272791063635040.
Started block  7, 2 on SM 52 at 1717272791063635072.
Started block  4, 2 on SM  4 at 1717272791063635072.
Started block  3, 3 on SM 108 at 1717272791063635072.
Started block  0, 3 on SM 66 at 1717272791063635072.
Started block  5, 2 on SM 20 at 1717272791063635072.
Started block  1, 2 on SM 78 at 1717272791063635072.
Started block  6, 1 on SM 34 at 1717272791063635072.
Started block  1, 3 on SM 80 at 1717272791063635072.
Started block  6, 2 on SM 36 at 1717272791063635072.
Started block  0, 0 on SM 128 at 1717272791063635104.
Started block  1, 0 on SM 129 at 1717272791063635104.
Started block  3, 1 on SM 48 at 1717272791063635104.
Started block  0, 1 on SM  0 at 1717272791063635104.
Started block  2, 0 on SM 130 at 1717272791063635104.
Started block  3, 0 on SM 131 at 1717272791063635104.
Started block  4, 0 on SM 124 at 1717272791063635104.
Started block  5, 0 on SM 125 at 1717272791063635104.
Started block  0, 7 on SM 102 at 1717272791063635200.
Started block  2, 6 on SM 12 at 1717272791063635232.
Started block  5, 6 on SM 60 at 1717272791063635200.
Started block  6, 6 on SM 74 at 1717272791063635104.
Started block  1, 1 on SM 16 at 1717272791063635104.
Started block  3, 6 on SM 28 at 1717272791063635104.
Started block  6, 0 on SM 126 at 1717272791063635104.
Started block  7, 0 on SM 127 at 1717272791063635104.
Started block  2, 1 on SM 32 at 1717272791063635104.
Started block  7, 6 on SM 88 at 1717272791063635104.
Started block  4, 6 on SM 44 at 1717272791063635104.

That deposit pattern looks like this (x numbered horizontal, y vertical):

| |0|1|2|3|4|5|6|7|
|0|2|2|2|2|2|2|2|2|
|1|2|2|2|2|0|1|1|0|
|2|1|1|0|0|1|1|1|1|
|3|1|1|0|1|1|1|1|1|
|4|1|1|0|1|2|2|2|2|
|5|2|2|2|2|2|2|2|2|
|6|2|2|4|2|2|3|2|2|
|7|3|0|1|1|0|0|1|0|

There might be a space filling curve, but its not obvious to me and I don’t see anything that looks serpentine. And if I were strictly going for cache locality, I don’t think I would choose that pattern (I think I would just choose a linear pattern in x rapidly varying, y slowly varying, as is suggested/observed here). Its not even obvious that cache utilization would be a primary concern for block deposit order. A CUDA programmer can guarantee fairly good cache utilization by design at the threadblock level, without any assistance from the machine.

I don’t know what the order constitutes, but if there were something I would be interested in, that I have little control over as a programmer, it would be the order in which DRAM banks are opened up and loaded with requests. I’m not saying that is what it represents.

Thanks for that link and for disabusing me of a notion I acquired some time ago.