Is there any restriction on the total number of threads that I can specify for kernel?
The only restriction that I’m aware of is the number of threads per block, but it appears that the number of blocks is limited only by the bounds on the grid dimensionality, i.e. 2^16x2^16.
I’m currently experiencing the following problem:
For the same kernel and the same data, it seems that everything works fine with the grid size 2^14x1, but it fails with 2^14x4 - kernel invocation failure - load takes too long. The successful invocation runs about 10 seconds, far longer than 5 seconds limit, which I seem to solve by disabling the watchdog.
Seems like I’m bumping into the same 5 seconds problem here, since the above 10 seconds occasionally had memory copy time included. So it’s about 7 seconds, and it sometimes works, sometimes doesn’t.