Question about grid/block/thread sizes

A very simple question I belive:
The output from the deviceQuery from the CUDA samples, shows me I have this max sizes:

Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535

I know that I can “confortably” work with this configs:

dim3 blocks(65535);
  dim3 threads(1024);
  kernel();

However, If I’m guessing correctly, I cannot work with something like this:

dim3 blocks(65535,65535);
  dim3 threads(1024,1024);
  kernel();

Because I have a maximum of 1024 threads per block, and I’m actually requesting 1024 per block in each dimension (giving 1024x1024 max threads), is this correct?

Thank you.

Hello,

It is not correct to submit 1024x1024. The total number of thread should 1024 totally, for the numbers of blocks is ok what you did. If you use dim3 threads(1024,1024) the kernels will not be executed. YOu could use for example dim3 threads(32,32).

If you have dim3 threads(tx,ty,tz) you have the following rules txtytz<=1024, ty<=1024,ty<=1024,tz<=64.

Thanks pasoleatis, that is exactly what I was thinking, too bad, it would really simplify my life if I could have that many threads, hehehe.

For the GTX 680, the max number of threads per thread block is 2048.