I have a short question: I have sometimes seen that people do a check on the calculated thread id that it is smaller than the number of threads launched. Something like this:
int id = blockIdx.x * blockDim.x + threadIdx.x;
if(id >= threadCount) return;
I read somewhere that this is necessary in case there are more threads started than actually requested, because of performance reasons or underlying hardware structure (e.g. there are always 32 threads in a warp could be a reason I guess).
However, I did not see this condition ever occur. Was there some change in compute capability that this is not required any more? Is it predictable when this will occur?
There is a very simple explaination. Often more threads than neccessary are launched because the number of required threads is not divisible by the block size. For example, you could have an array of 300 elements. If you use blocks of size 128, you will need 3 blocks to be able to use 1 thread per element. However, those three blocks will contain 384 threads.
Threads 300 to 383 must not access the array because the access would be out of bounds.
Often I play in the other way around. I am used to putting the id less than my problem size to avoid getting outside bounds in some array read and so on. Probably, the case that you are seeing in the code is just another way to specify the problem size.
Sometimes, people are likely to deploy a number of threads which is multiple of the warp size and the number of SMs for a better GPU occupancy and make the most of the coalesced memory access