CUDA - thread block confusion concept clearity sought

I have Ge 8800GS - 96 cores arranged in 12 MP.

Now I ran CUDA SDK device Query and it tells me that maximum threads per block = 512
and that max dimension of a block is 512 * 512 * 64.

Now if max dimension of a block is 512 * 512 * 64 => no. of threads should be this much rather than 512???

Where am I wrong?Help please.

from spec of compute capability 1.0, you have

  1. The maximum number of threads per block is 512

  2. The maximum sizes of the x-, y-, and z-dimension of a thread block are 512, 512, and 64, respectively

these two conditions (constraints) must be satisfied at the same time.

for example:

legal: 16 x 16 = 256, 4 x 4 x 8 = 128

illegal: 32 x 32 = 1024 (condition 2 fails), 8 x 8 x 15 = 1024 (condition 2 fails)

the maximum number of threads per block comes from hardware limitation since

GPU must provide synchronization operation for threads of one block, so this number cannot be so large,

512 is O.K. since it need 9-level AND gate to make sure synchronization

Thank you LSChien. So the deviceQuery function is creating confusion when it says max sizes of each dimension of a block is 512 * 512 * 64. It should have been 512 for dimension x , same for y and 64 for dimension z.

yes, it indeed said what you think about

The maximum sizes of the x-, y-, and z-dimension of a thread block are 512, 512, and 64, “respectively”

in fact, some conditions in spec of compute capability must be considered together,

I can show you another

from spec of compute capability 1.0

  1. The maximum number of active blocks per multiprocessor is 8

  2. The maximum number of active warps per multiprocessor is 24

  3. The maximum number of active threads per multiprocessor is 768

if you want to ask me “what is maximum number of active blocks in a multiprocessors”, then

you must combine these three conditions. however condition 2 is redudant, because from

condition 3 and 32 threads/warp, you have 768/32 = 24 warps

hence we only consider condition 1 and condition 3

example 1: suppose we choose size of threads block as 16 x 16 = 256, then

under condition 3, we have 768/256 = 3 blocks in a multiprocessor

under condition 1, 3 < 8, hence wew have 3 blocks in a multiprocessor

example 2: suppose we choose size of threads block as 4 x 4 = 16, then

under condition 3, we have 768/16 = 48 blocks in a multiprocessor

under condition 1, 48 > 8, hence wew only have 8 blocks in a multiprocessor

this means that we only have 8 * 16 = 128 active threads in a multiprocessor

Can I define a block size like dim3 dimBlock(20,20,1)? because I have been trying this, and do nothing, if I define the dimBlock dim3 dimBlock(20,12,1) works fine, but, after dim3 dimBlock(20,13,1) stops working, is like if the maximum number of threads per block were 256

you can do max of max
square_array <<< dim3(4096,16,1),dim3(16,32,1) >>> (memoiregraphique1, N,in,ss);
square_array <<< dim3(1024,64,1),dim3(64,8,1) >>> (memoiregraphique1, N,in,ss);
=40961616*32 thread

409616=65536 = 102464 =65536 max
1632=648=512 max for me perhaps 256 for you

When you submit the kernel you use a parameter dim3 (tx,ty,tz) which specifies the number of threads in a block. The device query only says that tx<=512, ty<=512 and tz<=64 and in addition to this 3 you also must have txtytz<=512.

So you have 4 conditions for the threads in a block:

tx<=512

ty<=512

tz<=64

and

txtytz<=512.