Dimensions of a Block and a Grid


From what I have understood by playing around with CUDA, it seems that one can have 3-dimensional threadId’s defined below as

dim3 threadBlock(8, 8, 8);

A block dimensions for it can be defined as
dim3 kernelBlockGrid(1, 1, 1);

If I change the last dimension of kernelBlockGrid to anything other than 1, I get the following error:

cufft: ERROR: D:/Bld/rel/gpgpu/toolkit/r2.0/cufft/src/execute.cu, line 1038
cufft: ERROR: D:/Bld/rel/gpgpu/toolkit/r2.0/cufft/src/execute.cu, line 297
cufft: ERROR: D:/Bld/rel/gpgpu/toolkit/r2.0/cufft/src/cufft.cu, line 119

As far as I understand, when I have something like

myKernel<<<kernelBlockGrid, threadBlock>>> …

in my program, kernelBlockGrid defines the dimensions of the Grid and threadBlock defines the layout of the threads in the block. If the maximum number of the threads per block is 512, how does having maximum block dimensions of 51251264 help? Maybe a scenario in which this will be helpful will help me understand it a bit better.



According to the Programming Guide (Sec. 2.2.2, pg. 8 of the CUDA 2.0 manual), grids can only be 2D, not 3D. So you aren’t allowed to set the last component of kernelBlockGrid to anything but 1, despite the fact that it is a dim3 variable. (Maybe someday CUDA hardware will support 3D grid dimensions.)

Thanks for your help. I had found that the grid dimensions were 2D. I was wondering if there was any way to go around it. I am experimenting with setting Y-dim = Y-dim * Z-dim and then accessing Y-dim in the kernel accordingly to try and ‘extract’ a z-dim from it. I have threadIdx.z on the gpu, I just need some way to get a z-id for a grid. I am trying to a 3D operation on a 3D-matrix :| Please let me know if you have a better way of doing it. Thanks.

What I am still a bit confused about is the block dimensions? I understand that one can have 512 threads per block and we specify grid dimensions and number of threads per block when invoking the kernel. Then what is the benefit of having max block dimensions of 51251264? :|


The product of threadBlock.x * threadBlock.y *threadBlock.z needs to be <=512 with the constraint that threadBlock.x<=512, threadBlock.y <=512 and threadBlock.z<=64.

So you can have block like (512,1,1) or (1,512,1) or (2,4,64).
A block (1,1,512) is not valid.

Thanks. That helps a lot.

Any suggestion on how I could perform operations on a 3D array? I’m trying to cast it as a 2D problem but for some reason in the kernel, the x, y, z indices into the 3D texture are incorrect. Thanks.



You can try the following… it is not sexy, but it works

tx = dimension_x * dimension_y

ty = dimension_z

y = truncf(tx/dimension_y);

x = tx - y*dimension_y;

z = ty;

now use the xyz for your texture fetch from a linear array([x+y*x_dimension…])

or, you could wait for someone to give you a less naive answer…

Thanks for that. I had something similar in ty (instead of tx as per your suggestion). I tried your suggestion too and it clearly iterates the x, y, z indices correctly.

My grid and thread block setup is as follows: (just in case i’m doing something wrong)

dim3 threadBlockData(8 * 8, 8, 1);

// This is using the convolution2D demo.

dim3 dataBlockGrid(iDivUp(FFT_W , threadBlockData.x ), iDivUp(FFT_D, threadBlockData.y), 1);

padData<<<dataBlockGrid, threadBlockData>>> ..

So I am actually trying to do 3d convolution on a 3d array. I feel like i’m close since some answers are correct, but most just look like they’re off by a bit. I am comparing the gpu output with that of the cpu convolution. I too would have thought that a neat indexing with 3d cufft would do the trick. Any suggestions? :| Thanks.



Hi nvidia,

i was testing different configuration on cuda before using one. It is confirmed from my program’s output that threads should be less than equal to 512 but there are some configurations which looks valid but are not working.

for example

Config: By=1, Bx=1, Ty=512, Tx=1 : Not Valid

Config: By=1, Bx=1, Ty=32, Tx=1 : Not Valid

Config: By=1, Bx=1, Ty=1, Tx=1 : Not Valid

Config: By=1, Bx=1, Ty=1, Tx=2 : Blocks=1 Threads=2 Valid

Config: By=1, Bx=1, Ty=1, Tx=4 : Blocks=1 Threads=4 Valid

Config: By=1, Bx=1, Ty=1, Tx=8 : Blocks=1 Threads=8 Valid

Config: By=1, Bx=1, Ty=1, Tx=16 : Blocks=1 Threads=16 Valid

Config: By=1, Bx=1, Ty=1, Tx=32 : Blocks=1 Threads=32 Valid

Config: By=1, Bx=1, Ty=1, Tx=64 : Blocks=1 Threads=64 Valid

Config: By=1, Bx=1, Ty=1, Tx=128 : Blocks=1 Threads=128 Valid

Config: By=1, Bx=1, Ty=1, Tx=256 : Blocks=1 Threads=256 Valid

Config: By=1, Bx=1, Ty=1, Tx=512 : Blocks=1 Threads=512 Valid

(where Tz = 1 in all cases)

Either my test program is giving incorrect results or there is any other technical reason for that. (my test program file is attached)

kindly help me with your comments.


configTest.txt (4.3 KB)