in the example [1] the loop:
for (int i = index; i < n; i += stride)
would calculate with thread#x the following data:
t1: 0, 256, 512,
t2: 1, 257, 514
what if the code starts thread number 256, wouldnt that overwrite the calculation already done by thread#0 ?
1: An Even Easier Introduction to CUDA | NVIDIA Technical Blog