Thread Number Limitation

Hello, everyone.

I’m very confusing about the thread number limitation.
I’v read cuda programming guide and many articles to understand that problem, but I couldn’t.

I has 8500GT device and it compatibles with Compute 1.1

CUDA compute 1.1 specifies that
first, the maximum thread number per block is 512.
second, maximum block dimemsion size and maximum grid dimension size are 512 * 512 * 64 and 65535 * 65535 respectively.

When i read this restriction, i understood that
the limitation of dimension size is the aspect of programming
and the thread number limitation is just about the unit which the device can process once.
Hence, I’v written sample programs ignoring the thread number limitation,
but the programs are larger larger, the kernels do not work!!
actually, i think that the kernels could not be work because i define the block dimension over the maximum thread number.

Thus, I tested some code and I found that the block dimension should not over the maximum thread number. i.e. blockDim.x * block.Dim.y * blockDim.z must be less than 512.

I can’t trust this result. (because, i’ve still not understood CUDA internals completely…)
this result is against the second specification which i mentioned above.

is there anyone to clarify relations between the maximum thread number and the maximum block dimension size???
Help me~~.

As far as I know

The maximum threads/block is 512

so you can not use more than 512 threas/block (your kernel can not launch)

for example: (256, 2, 2) will not allow.

the maximum block/grid is 65535X65535.

Summary you can defined

dim3 thread(512, 1,1);

dim3 block(65535, 65535, 1);

I think that your program will work.

(I tested with dim3 thread(128, 1, 1); and dim3 block(65535, 65535, 1); it worked OK)

Thanks for reply, Quoc Vinh.

OK.I see.

i’ve thought the 512 means hardware implementation and 51251264 is the logical limitation…

The maximum thread number means the number of thread which my kernel can lunch at once.

But, I still have a question.

if the device can run threads at most 512,

then what does the maximum block dimension size means?

“deviceQuery” on my device results as below…

those are the maximum values for each dimension.

So you can have (512,1,1)

(1,512,1)

and e.g. (4,2,64).

512 threads is the maximum amount of threads per block.

Hardware-wise, you speak of multiprocessors, and for older cards they can hold a maximum of 768 threads (or three blocks of 256 threads) at a time, while for GT200 it is 1024 threads.