What if number of threads is not divisible by block size

Hi, I am wondering what is the best way for handling the situations where the number of threads is not divisible by block size? I know I can do the calculations for the extra threads on the CPU but I want to know if I can do everything on GPU. I was able to run all the threads on the GPU and in order to avoid error I put all my kernel function inside a if statement so it is executed only of index is less than number of threads.

__global__ void  kernel ( ...)


   int index = __mul24(blockIdx.x,blockDim.x) + threadIdx.x;

   if (index<numBodies){

	 (kernel funtion)




Is there any problem with this solution?


It’s fine. You won’t get any performance penalty because all the working threads will return true on that condition and there will be no branching.

So yes, divide numBodies through block size, round the result up if the modulo is non-zero and you have your grid size. Then do this check in the kernel.

This seems to be a good solution!
In a problem like yours I needed 3 float values. So I’ve used fourth empty field of float4 type and fill it with a value. Then I’ve used an if statement to check the value when I read data.


Thanks! :thumbup: