This line is a thread race… you have multiple threads simultaneously accessing and updating returnAccess.
if (isValid) {
ret[++returnIndex] = idx % 7;
}
Solutions are many… it’s a common problem.
If you have only rare values to write, then a global atomic increment will work nicely.
But if you have many successful values to write, it can be better to write ALL values, including failures, into predetermined slots, then run a compaction kernel to snip out the unused values. This approach has tons of variants too, ranging from custom coded ones to simple but generic library call (like CUDPP’s).
No, each thread does not have its own returnIndex. You declared it as a shared variable. This is correct, though, since per-thread indexing would be meaningless.
If you do write every value, then it’s easy since every thread knows exactly where it will write. It also means perfect coalescing, even on 1.0 hardware.
No, although that is one line of C, it compiles down to three instructions:
Read returnIndex from shared memory to a register.
(Since you declared returnIndex as volatile, the compiler is not allowed to reuse returnIndex from a previous read)
Increment that register by 1.
Write that register back to returnIndex location in shared memory.
Just for future reference, compute capability 1.2 devices and higher (GTX 200 cards and some others) have shared memory atomic operations which can do this in a thread-safe way.