I want to call a kernel for every element of array, but the array length can possibly exceed the thread limit. So, I used this formula to calculate the number of blocks:
Make it simple:
always use (e.g.) 1024 (or 256) threads and put an if condition into the kernel to check whether the thread id is above the number of elements.
For the blocks,
either use enough blocks (the number of blocks can be very high; regardless of hardware) or
use a grid-stride loop (a grid-stride loop is a for loop within the kernel, which iterates, if the number of blocks, which ideally would be set as a multiple of the number of SMs, is too low compared to the number of elements).
If you want to do pairs of elements, either
just use a thread per first element and then have a for loop in the kernel for the second element index, or
use multi-dimensional block or grid size with dim3, and one dimension (e.g. x) is for the first element number (first half of the pair) and the second dimension (e.g. y) for the second element of the pair, or
use some pairing function like the Cantor pairing function to convert between a one-dimensional number to a two-dimensional number and back. This can even be a simplified function just returning, if both indices are the same. E.g. just use N² as size and i/N and i%N for the indices.