In the following example:
I used that code in my program and it works fine. However I think it’s incorrect. The threads within the same warp writes to shared memory to exchange data, but there’s no __syncthreads() call and the shared memory isn’t declared volatile.
I found out the hard way that ‘volatile shared’ is required even within the same warp. I didn’t suspect this because I assumed the example code was correct.