Calling kernel in a number of threads exceeding the limit

I want to call a kernel for every element of array, but the array length can possibly exceed the thread limit. So, I used this formula to calculate the number of blocks:

numThreads = chunkCount > 1024 ? 1024 : chunkCount;
numBlocks = (chunkCount + numThreads - 1) / chunkCount;

Is this right? What if I want to execute kernel for every pair of elements? How do I calculate the index of the element?

Make it simple:
always use (e.g.) 1024 (or 256) threads and put an if condition into the kernel to check whether the thread id is above the number of elements.
For the blocks,

  • either use enough blocks (the number of blocks can be very high; regardless of hardware) or
  • use a grid-stride loop (a grid-stride loop is a for loop within the kernel, which iterates, if the number of blocks, which ideally would be set as a multiple of the number of SMs, is too low compared to the number of elements).

If you want to do pairs of elements, either

  • just use a thread per first element and then have a for loop in the kernel for the second element index, or
  • use multi-dimensional block or grid size with dim3, and one dimension (e.g. x) is for the first element number (first half of the pair) and the second dimension (e.g. y) for the second element of the pair, or
  • use some pairing function like the Cantor pairing function to convert between a one-dimensional number to a two-dimensional number and back. This can even be a simplified function just returning, if both indices are the same. E.g. just use N² as size and i/N and i%N for the indices.
1 Like

Thank you! This is very helpful. Btw, is this how Thrust operates?

Hi @GulgDev,
I am not sure, how Thrust does it internally. Probably quite similar. Perhaps somebody knowing it, can chime in.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.