question for block and thread control

Hi.

I’m reading CUDA guide, but I guess I did not fully understand the relation of S/W and H/W.

Specially

of grid, # of block, # of thread

vs.

of H/W SMs & # thread per SMs * SMs per device

I’m wondering how I configure dimGrid and dimBlock for my application.
I’ll use 1 thread per each SPs (in global function) with using without idle SPs.
and I’ll use tid to access specific global memory.
would you advise me any tips how to configure?

which one I need to use configure of <<<A,B>>>(); ?

case1
func<<<1,128>>> ();

case2
dim3 dimBlock(1,128);
func<<<1,dimBlock>>> ();

case3
func<<<16,8>>> ();

thanks.

Hi.

In general you should not rely too much on the hardware mp count, as this changes with the next generation and among versions of one e.g. G80 chip.

So for the mapping to the hardware think of the grid as the complete algorithm you want to distribute.
One block runs on one MP, therefore there can be no communication between blocks.
Inside the blocks you have the threads which “do” the actual work.
Threadcounts should be at least 32 and always dividable by 32 to meet the requirement of a full warp as threads are scheduled as warps.

Indeed 16 blocks for a full blown G80 would most probably utilize the 16 MPs, however you are limited to 512 threads per block and are not using the 768 threads of an MP.
So multiple blocks should be scheduled per MP and in my opinion this number should be auto adjusting to the problem size and the hardware (see the new GT200 with 200 MPs).

the kernel call has at least 2 parameters lets call them GRIDDING and BLOCKING
kernel<<<GRIDDING,BLOCKING>>>

GRIDDING defines the shape of your grid of blocks. If you provide one integer here you have a one dimensional grid and GRIDDING blocks to run.
As it is only 2dimensional adressable at most, only the first two entries of an initialized dim3 variable are used.
So dim3 dimGRID(256,256,256) would give you 256x256 blocks.

With threads its the same, just that threads have 3 dimensions.
So dim3 dimBLOCK (32,2,2 ) would give you 32x2x2 threads to run per block.

case1
func<<<1,128>>> ();
–>one block 128 threads per block

case2
dim3 dimBlock(1,128);
func<<<1,dimBlock>>> ();
–>one block 128 threads per block

case3
func<<<16,8>>> ();
→ 16 blocks 8 threads per block

Hope that helps.

Johannes