Grid-Block-Thread Configuration

Hi guys!

I still get confused by the [grid, block, thread] configuration on the kernel (amazing!).
Max ‘x’ dimension values = [2^31,1024,1024] respectively for compute capability 3.0 (or GeForce GTX 680).

It seems to me that kernel needs 3 parameters but only 2 are used all the time right?
I’m actually launching a kernel with this configuration:
<<<1250000, 1024>>>

How is this read? 1 Grid with 1250000 blocks, each block with 1024 threads? Isn’t 1024 maximum ‘x’ dimension for blocks?

What if I want 3 grids with 16 blocks, each block with 32 threads… what is the proper configuration for this parameters?

Thanks in advance for the noob question :)

Hello,

The kernel needs 2 dim3 arguments. dim3 is a integer structure with 3 components. If you use just a number it will automatically put 1 for the dimensions not mentioned. So in your case you launch a grid of blocks (12500000,1,1) and in each block is (1024,1,1). The third argument is related to use of shared memory and if it is not set it is 0. For a grid with 16 blocks you would use the launching :
<<<16,32>>>
You can define 2 dim3:

dim3 grid,threds;
grid.x=16;
grid.y=1;
grid.z=1;
threads.x=32;
threads.y=1;
threads.z=1;

launch with <<<grid,threads>>>

Thanks for taking the time pasoleatis! (I also noted you helped me with the “speed up” post :)

I realize my mistake now!

Watching at a picture (Figure 7 at CUDA_C_Programing_Guide v5.0), it looked like I could launch several grids of blocks…

Something like:
kernel<<<2,16,32>>>

That is, 2 grids of 16 blocks, each block with 32 threads.

But now, looking at a different picture (more specific one) I realize “Grid 1” and “Grid 2” on the picture belonged to different kernel launches.

http://ixbtlabs.com/articles3/video/cuda-1-p5.html

Sorry for the super noob mistake (I personally blame Figure 7 of programming guide v5.0 :P).

Thanks again!

Hello,

I am not sure what this kernel<<<2,16,32>>>launches. Ithink it will launch 2 blocks with 16 threads and allocate 32 bytes of shared memory. per block.

The image you mentioned is something like :

kernel1<<<16,32>>>

host code

kernel2<<<16,32>>>
This means that kernel1 launch is sent to the gpu (execution starts independent on the host code which follows) , then host stats to execute some code and when this is finished the second kernel launch is sent to the gpu. The kernel2 follows kernel1. They are being done in the same time with the host, each kernel is executed only after the previous kernel is finished.

According the programming guide for each kernel launch only 2 numbers are, required http://docs.nvidia.com/cuda/cuda-c-programming-guide/#thread-hierarchy

For beginning I suggest first writing simple programs and then building more and more complex codes.