are threads of a warp really sync?

Hi all,
There are 2 kernels in the attachment. They’re from a presentation of an Nvidia guy. Both do the same work: they sum the values up in an array. The second kernel unrolls the loo for the last warp. Since threads of a warp works synchronously nvidia guys suggested that we don’t need any __syncthreads(). The second kernel however does not sum correctly. I fixed the code by adding __syncthreads(); after each statement in the unrolled part and it worked correctly. Also I came up with another solution without __syncthreads();. Astonishingly it works. Now I’m quite confused:

Are the threads of a warp really work synchronously? If so, how come the original code doesn’t work?
Without __syncthreads(); how can my solution work?

kernel2.cu (2.53 KB)

The original code was correct on the old G80 architecture but it isn’t anymore on Fermi, because the compiler is doing more agressive optimizations. If shared stores/reads aren’t separated by __syncthreads() the compiler might decide to hold intermediate values in registers instead of writing them to shared memory. The standard fix for this avoiding the synchronization is to mark the shared array as volatile. (For more details see the Fermi Compatibility Guide, chapter 1.3.3)

Thanks a lot for the help MarkusM External Image