Hi, I am wondering what is the best way for handling the situations where the number of threads is not divisible by block size? I know I can do the calculations for the extra threads on the CPU but I want to know if I can do everything on GPU. I was able to run all the threads on the GPU and in order to avoid error I put all my kernel function inside a if statement so it is executed only of index is less than number of threads.
__global__ void kernel ( ...)
{
int index = __mul24(blockIdx.x,blockDim.x) + threadIdx.x;
if (index<numBodies){
(kernel funtion)
...
}
}
Is there any problem with this solution?
thanks