Is there any restriction on the total number of threads that I can specify for kernel?
The only restriction that I’m aware of is the number of threads per block, but it appears that the number of blocks is limited only by the bounds on the grid dimensionality, i.e. 2^16x2^16.
I’m currently experiencing the following problem:
For the same kernel and the same data, it seems that everything works fine with the grid size 2^14x1, but it fails with 2^14x4 - kernel invocation failure - load takes too long. The successful invocation runs about 10 seconds, far longer than 5 seconds limit, which I seem to solve by disabling the watchdog.
I’m running CUDA 1.0 with G8800 on Windows XP.