I have the that takes the data, organizes it into blocks with 32 threads each and does some computation on it.
However, the problem is that I never know how much data will I have. Therefore, I cannot simply assume that it will nicely divide into certain amount of blocks.
Suppose I have
points to do the computation. With 32 threads per block, I will have 31 block and 8 points left. How do I process these remaining points? If I simply allow 32 blocks, the 24 points in the last block will access the wrong memory and write to the places they are not supposed to. If I put a conditional statement
if(bx*bdim + tx >= N) return;
then it will slow down the entire program. I’m sure there is some smart “CUDA” way of solving such problem, but I couldn’t find it.
I would really appreciate any help.