I am a bit unsure about the use of local variables inside a kernel. Does the code below look ok or would there be a conflict between the local variables
and the threads, e.g. a race condition? I would appreciate your advice, even if it is just “it is ok”.
There’s really nothing wrong with an if statement like you have, which cuts off threads that extend past the array boundary. (This is quite common when your data size is not divisible by a good block size.) You obviously don’t want a huge number of unused threads, but a warp or two isn’t going to kill your performance, especially since this is memory bandwidth-bound code.
Also, you don’t need any __syncthreads() at the end of this code. That is only used to avoid shared memory race conditions when you have threads reading shared memory locations after other threads write to them. (And that’s basically it. It protects you from no other race conditions!)