Variable number of calls Last block performance question

Hello all,

I have the that takes the data, organizes it into blocks with 32 threads each and does some computation on it.

However, the problem is that I never know how much data will I have. Therefore, I cannot simply assume that it will nicely divide into certain amount of blocks.

Suppose I have

N=1000

points to do the computation. With 32 threads per block, I will have 31 block and 8 points left. How do I process these remaining points? If I simply allow 32 blocks, the 24 points in the last block will access the wrong memory and write to the places they are not supposed to. If I put a conditional statement

if(bx*bdim + tx  >= N) 

   return;

then it will slow down the entire program. I’m sure there is some smart “CUDA” way of solving such problem, but I couldn’t find it.

I would really appreciate any help.

Hello all,

I have the that takes the data, organizes it into blocks with 32 threads each and does some computation on it.

However, the problem is that I never know how much data will I have. Therefore, I cannot simply assume that it will nicely divide into certain amount of blocks.

Suppose I have

N=1000

points to do the computation. With 32 threads per block, I will have 31 block and 8 points left. How do I process these remaining points? If I simply allow 32 blocks, the 24 points in the last block will access the wrong memory and write to the places they are not supposed to. If I put a conditional statement

if(bx*bdim + tx  >= N) 

   return;

then it will slow down the entire program. I’m sure there is some smart “CUDA” way of solving such problem, but I couldn’t find it.

I would really appreciate any help.

The slowdown is negligible if the kernel does more than just a few instructions, so this is indeed the recommended solution.

The slowdown is negligible if the kernel does more than just a few instructions, so this is indeed the recommended solution.