hey, I am trying to run a kernal with 34 blocks and 1771 threads but I get the error cudaErrorInvalidConfiguration(9). 1771 is too many threads. I don’t care if they run concurrently but I use the threadIdx.x
as an index of array so I need it to be 1771 and in some cases even a lot higher. Is there a way to do it? and if not, how should I approach the problem?
Rewrite your indexing so it doesn’t exceed the maximum number of threads per block (1024).
Use Grid-stride loops.
If you reduce the threads per block you could maybe use something like this:
uint32_t index = (blockDim.x * blockIdx.x + threadIdx.x) % 1771;