the maximum number of blocks and threads

What is the maximum number of blocks and threads in a grid. While using GPU caculating the product of two big matrice, I found the result is different between the GPU and CPU. So I guess the number of threads is restricted.

Yes, these numbers are restricted. Please check the Programming Guide for details.

But I cannot find it in “Programming_Guide_2.0beta2”. Can you tell me the numbers? My display card is 9600GT, and CUDA2.0.

Appendix A.1.1:

"The maximum sizes of the x-, y-, and z-dimension of a thread block are 512, 512,

and 64, respectively"

“The maximum size of each dimension of a grid of thread blocks is 65535”

Addition note: Start the CUDA SDK example “DeviceQuery”.

with a maximum of 512 threads per block, and only 2 dimensions in the grid.

Thanks, everybody!
But what I really wanna know is the maximum number of threads per grid.

512 * (65535^2) = 2 198 956 147 200

The biggest size of the matrix I calculated is 4720*4720=22 278 400, which is far smaller than the number 2 198 956 147 200 and 512 * 65535=33 553 920.

would be interesting to launch a grid of that size with an empty kernel to see how much overhead this incurs.

I’ve just benchmarked launching the biggest empty kernel (65535^2 blocks, 512 threads) and it takes actually less than launching a smaller kernel (256 blocks). WTH.

10’000 launches, threadsync after each:
big kernel - 209ms (21 microseconds per kernel)
small kernel - 300ms (30 microseconds per kernel)

Also, apparently I can launch empty kernels bigger than 65535x65535. I must be doing something wrong?

EDIT: Duh! I misplaced gridDim and blockDim in launch parameters.

Now calling a maxed out kernel results in either a timeout error or a bluescreen. So the overhead is surely larger than about 10s. Perhaps a linux machine with no watchdog timer could run such a kernel.