Can this be parallelized?

But according to the lines:

int BlockOff=bIDx<<8; // 256 thread, each deals with one element

int address0 = tIDx + BlockOff;

when the blockId is 0, I can access the first 0-255 elements, and when the blockID is 1, I can access the 256-512 elements. isn’t it?

if blockID is 1, BlockOff=0, tIDx is still 0~255,

so you are still accessing 0~255, because tIDx + BlockOff is still in the range of 0~255

Why???

int BlockOff=bIDx<<8

does that mean for different blocks I get a 256 jump? I shift bIDx left and that is the same as bIDx*256…

why I am accessing element 0~255 for first block, then 1~256 for the second block?

Sorry, I misunderstood it…

Now I can do the intersection of 16M and 1M in 4.5 ms, thanks guys :)