question for block and thread control

yhgon · June 23, 2008, 1:50am

Hi.

I’m reading CUDA guide, but I guess I did not fully understand the relation of S/W and H/W.

Specially

of grid, # of block, # of thread

vs.

of H/W SMs & # thread per SMs * SMs per device

I’m wondering how I configure dimGrid and dimBlock for my application.
I’ll use 1 thread per each SPs (in global function) with using without idle SPs.
and I’ll use tid to access specific global memory.
would you advise me any tips how to configure?

which one I need to use configure of <<<A,B>>>(); ?

case1
func<<<1,128>>> ();

case2
dim3 dimBlock(1,128);
func<<<1,dimBlock>>> ();

case3
func<<<16,8>>> ();

thanks.

JHHPC · June 23, 2008, 8:45am

Hi.

In general you should not rely too much on the hardware mp count, as this changes with the next generation and among versions of one e.g. G80 chip.

So for the mapping to the hardware think of the grid as the complete algorithm you want to distribute.
One block runs on one MP, therefore there can be no communication between blocks.
Inside the blocks you have the threads which “do” the actual work.
Threadcounts should be at least 32 and always dividable by 32 to meet the requirement of a full warp as threads are scheduled as warps.

Indeed 16 blocks for a full blown G80 would most probably utilize the 16 MPs, however you are limited to 512 threads per block and are not using the 768 threads of an MP.
So multiple blocks should be scheduled per MP and in my opinion this number should be auto adjusting to the problem size and the hardware (see the new GT200 with 200 MPs).

the kernel call has at least 2 parameters lets call them GRIDDING and BLOCKING
kernel<<<GRIDDING,BLOCKING>>>

GRIDDING defines the shape of your grid of blocks. If you provide one integer here you have a one dimensional grid and GRIDDING blocks to run.
As it is only 2dimensional adressable at most, only the first two entries of an initialized dim3 variable are used.
So dim3 dimGRID(256,256,256) would give you 256x256 blocks.

With threads its the same, just that threads have 3 dimensions.
So dim3 dimBLOCK (32,2,2 ) would give you 32x2x2 threads to run per block.

case1
func<<<1,128>>> ();
–>one block 128 threads per block

case2
dim3 dimBlock(1,128);
func<<<1,dimBlock>>> ();
–>one block 128 threads per block

case3
func<<<16,8>>> ();
→ 16 blocks 8 threads per block

Hope that helps.

Johannes

Topic		Replies	Views
Block size and grid size CUDA Programming and Performance	5	8393	April 27, 2009
Questions about Block and Grid CUDA Programming and Performance	4	3557	February 26, 2008
How to decide the optimal block size in CUDA CUDA Programming and Performance	4	27775	February 15, 2010
trouble learning how to set block and max thread size CUDA Programming and Performance	4	1985	January 26, 2011
CUDA software and hardware mapping CUDA Programming and Performance	5	14694	February 21, 2009
Block/threads and stuff... CUDA Programming and Performance	5	4916	September 12, 2008
Grid-Block-Thread Configuration CUDA Programming and Performance	3	3158	January 23, 2014
The choose of grid size and block size CUDA Programming and Performance	8	3684	May 8, 2024
Question about dimGrid CUDA Programming and Performance	1	841	August 4, 2010
General Formula for Thread/Block Ratio CUDA Programming and Performance	1	605	June 2, 2011

question for block and thread control

of grid, # of block, # of thread

of H/W SMs & # thread per SMs * SMs per device

Related topics