What is the maximum number of blocks and threads in a grid. While using GPU caculating the product of two big matrice, I found the result is different between the GPU and CPU. So I guess the number of threads is restricted.
Yes, these numbers are restricted. Please check the Programming Guide for details.
But I cannot find it in “Programming_Guide_2.0beta2”. Can you tell me the numbers? My display card is 9600GT, and CUDA2.0.
Appendix A.1.1:
"The maximum sizes of the x-, y-, and z-dimension of a thread block are 512, 512,
and 64, respectively"
“The maximum size of each dimension of a grid of thread blocks is 65535”
Addition note: Start the CUDA SDK example “DeviceQuery”.
with a maximum of 512 threads per block, and only 2 dimensions in the grid.
Thanks, everybody!
But what I really wanna know is the maximum number of threads per grid.
512 * (65535^2) = 2 198 956 147 200
The biggest size of the matrix I calculated is 4720*4720=22 278 400, which is far smaller than the number 2 198 956 147 200 and 512 * 65535=33 553 920.
would be interesting to launch a grid of that size with an empty kernel to see how much overhead this incurs.
I’ve just benchmarked launching the biggest empty kernel (65535^2 blocks, 512 threads) and it takes actually less than launching a smaller kernel (256 blocks). WTH.
10’000 launches, threadsync after each:
big kernel - 209ms (21 microseconds per kernel)
small kernel - 300ms (30 microseconds per kernel)
Also, apparently I can launch empty kernels bigger than 65535x65535. I must be doing something wrong?
EDIT: Duh! I misplaced gridDim and blockDim in launch parameters.
Now calling a maxed out kernel results in either a timeout error or a bluescreen. So the overhead is surely larger than about 10s. Perhaps a linux machine with no watchdog timer could run such a kernel.