One of the optimization strategies described in the reduction example completely unrolls the last 32 threads (i.e. the last warp) since they are SIMD synchronous. However, this example is using a 1D block of threads.
Question: for a 2D block of threads, which threads are in the same warp? If tidx = threadIdx.x and tidy = threadIdx.y, which ones are in the same warp and therefore don’t need the __syncthreads() ?
Thanks for the pointer. That was the missing link. If I combine that section with section 3.1 (A Set of SIMT Multiprocessors with On-Chip Shared Memory), then I my answer is complete.