There are 2 kernels in the attachment. They’re from a presentation of an Nvidia guy. Both do the same work: they sum the values up in an array. The second kernel unrolls the loo for the last warp. Since threads of a warp works synchronously nvidia guys suggested that we don’t need any __syncthreads(). The second kernel however does not sum correctly. I fixed the code by adding __syncthreads(); after each statement in the unrolled part and it worked correctly. Also I came up with another solution without __syncthreads();. Astonishingly it works. Now I’m quite confused:
Are the threads of a warp really work synchronously? If so, how come the original code doesn’t work?
Without __syncthreads(); how can my solution work?
kernel2.cu (2.53 KB)