CUDA - thread block confusion concept clearity sought

cirus · September 4, 2009, 8:49pm

I have Ge 8800GS - 96 cores arranged in 12 MP.

Now I ran CUDA SDK device Query and it tells me that maximum threads per block = 512
and that max dimension of a block is 512 * 512 * 64.

Now if max dimension of a block is 512 * 512 * 64 => no. of threads should be this much rather than 512???

Where am I wrong?Help please.

LSChien · September 5, 2009, 1:04am

from spec of compute capability 1.0, you have

The maximum number of threads per block is 512
The maximum sizes of the x-, y-, and z-dimension of a thread block are 512, 512, and 64, respectively

these two conditions (constraints) must be satisfied at the same time.

for example:

legal: 16 x 16 = 256, 4 x 4 x 8 = 128

illegal: 32 x 32 = 1024 (condition 2 fails), 8 x 8 x 15 = 1024 (condition 2 fails)

the maximum number of threads per block comes from hardware limitation since

GPU must provide synchronization operation for threads of one block, so this number cannot be so large,

512 is O.K. since it need 9-level AND gate to make sure synchronization

cirus · September 5, 2009, 1:49am

Thank you LSChien. So the deviceQuery function is creating confusion when it says max sizes of each dimension of a block is 512 * 512 * 64. It should have been 512 for dimension x , same for y and 64 for dimension z.

LSChien · September 5, 2009, 3:37am

yes, it indeed said what you think about

The maximum sizes of the x-, y-, and z-dimension of a thread block are 512, 512, and 64, “respectively”

in fact, some conditions in spec of compute capability must be considered together,

I can show you another

from spec of compute capability 1.0

The maximum number of active blocks per multiprocessor is 8
The maximum number of active warps per multiprocessor is 24
The maximum number of active threads per multiprocessor is 768

if you want to ask me “what is maximum number of active blocks in a multiprocessors”, then

you must combine these three conditions. however condition 2 is redudant, because from

condition 3 and 32 threads/warp, you have 768/32 = 24 warps

hence we only consider condition 1 and condition 3

example 1: suppose we choose size of threads block as 16 x 16 = 256, then

under condition 3, we have 768/256 = 3 blocks in a multiprocessor

under condition 1, 3 < 8, hence wew have 3 blocks in a multiprocessor

example 2: suppose we choose size of threads block as 4 x 4 = 16, then

under condition 3, we have 768/16 = 48 blocks in a multiprocessor

under condition 1, 48 > 8, hence wew only have 8 blocks in a multiprocessor

this means that we only have 8 * 16 = 128 active threads in a multiprocessor

dulcegao · November 10, 2011, 12:41am

Can I define a block size like dim3 dimBlock(20,20,1)? because I have been trying this, and do nothing, if I define the dimBlock dim3 dimBlock(20,12,1) works fine, but, after dim3 dimBlock(20,13,1) stops working, is like if the maximum number of threads per block were 256

cricri1 · November 10, 2011, 2:17pm

you can do max of max
square_array <<< dim3(4096,16,1),dim3(16,32,1) >>> (memoiregraphique1, N,in,ss);
square_array <<< dim3(1024,64,1),dim3(64,8,1) >>> (memoiregraphique1, N,in,ss);
=40961616*32 thread

409616=65536 = 102464 =65536 max
1632=648=512 max for me perhaps 256 for you

pasoleatis · November 10, 2011, 2:21pm

When you submit the kernel you use a parameter dim3 (tx,ty,tz) which specifies the number of threads in a block. The device query only says that tx<=512, ty<=512 and tz<=64 and in addition to this 3 you also must have txtytz<=512.

So you have 4 conditions for the threads in a block:

tx<=512

ty<=512

tz<=64

and

txtytz<=512.

Topic		Replies	Views
Thread Number Limitation CUDA Programming and Performance	3	3889	December 22, 2008
Maximum number of threads on thread block CUDA Programming and Performance	12	73117	September 21, 2023
Understanding deviceQuery CUDA Programming and Performance	2	4087	June 28, 2014
Question about grid/block/thread sizes CUDA Programming and Performance	3	12266	November 13, 2012
I wonder maximum number of threads per block really limits the number of threads in each block. CUDA Programming and Performance	5	3976	February 9, 2024
Questions about Block and Grid CUDA Programming and Performance	4	3545	February 26, 2008
Question about device query information, 48 cuda cores * 32, versus 1024 limitation. Maximum perform CUDA Programming and Performance	3	11490	June 16, 2011
How many can use Blocks to effcient parallel prog CUDA Programming and Performance	8	5789	December 12, 2009
Invalid Configuration Argument CUDA Programming and Performance	2	1715	December 16, 2018
Architecture Questions CUDA Programming and Performance	6	8169	February 12, 2008

CUDA - thread block confusion concept clearity sought

Related topics