Hello all,

I have the that takes the data, organizes it into blocks with 32 threads each and does some computation on it.

However, the problem is that I never know how much data will I have. Therefore, I cannot simply assume that it will nicely divide into certain amount of blocks.

Suppose I have

```
N=1000
```

points to do the computation. With 32 threads per block, I will have 31 block and 8 points left. How do I process these remaining points? If I simply allow 32 blocks, the 24 points in the last block will access the wrong memory and write to the places they are not supposed to. If I put a conditional statement

```
if(bx*bdim + tx >= N)
return;
```

then it will slow down the entire program. I’m sure there is some smart “CUDA” way of solving such problem, but I couldn’t find it.

I would really appreciate any help.