In the documentation (C development guide), as well as in GPU computing SDK, it is noted that in reduction, no __syncthreads() is
needed once one is left with the last 32 threads.
On C1060 (compiled with -arch=sm_13) however, the reduction to compute minimum runs ~15% faster,
if __syncthreads() IS included in the macro CHECK, as so:
if (threadIdx.x < 64) {
if (s[threadIdx.x] > s[threadIdx.x + 64])
s[threadIdx.x] = s[threadIdx.x +64];
}
__syncthreads();
if (threadIdx.x < 32) {
CHECK(32)
CHECK(16)
CHECK(8)
CHECK(4)
CHECK(2)
CHECK(1)
}
Does anybody happen to know why?
Notes: The array s was declared volatile.
The code gives correct results either way.
Replacing comparison with fmin runs tiny bit faster, but the same holds. Adding __syncthreads to CHECK makes threads faster.
Kernel is run with 128 threads.
Do both kernels use the same number of registers (check by adding [font=“Courier New”]-Xptxas=-v[/font]) to the nvcc invocation? After staring at a lot of disassembled sm_13 code recently, I could imagine that the version with __syncthreads() might use fewer registers and thus achieve higher occupancy.
You can have a look yourself by running the executable through [font=“Courier New”]cuobjdump -sass[/font] for both versions. If you find it difficult to interpret the results, post them here.
You might also try to replace the unnecessary [font=“Courier New”]__syncthreads()[/font] with [font=“Courier New”]__threadfence_block()[/font] and see what happens. Finally, you might find completely different results by using the new LLVM-based compiler in CUDA 4.1 with [font=“Courier New”]nvcc -nvvm[/font].