min reduction How to compute minimum in the fastest way

In the documentation (C development guide), as well as in GPU computing SDK, it is noted that in reduction, no __syncthreads() is
needed once one is left with the last 32 threads.

On C1060 (compiled with -arch=sm_13) however, the reduction to compute minimum runs ~15% faster,
if __syncthreads() IS included in the macro CHECK, as so:

#define CHECK(diff)
if (s[threadIdx.x] > s[threadIdx.x + diff])
s[threadIdx.x] = s[threadIdx.x + diff];
__syncthreads();

            if (threadIdx.x < 64) {
                    if (s[threadIdx.x] > s[threadIdx.x + 64])
                            s[threadIdx.x] = s[threadIdx.x +64];
            }
            __syncthreads();
            if (threadIdx.x < 32) {
                    CHECK(32)
                    CHECK(16)
                    CHECK(8)
                    CHECK(4)
                    CHECK(2)
                    CHECK(1)
           }

Does anybody happen to know why?

Notes: The array s was declared volatile.
The code gives correct results either way.
Replacing comparison with fmin runs tiny bit faster, but the same holds. Adding __syncthreads to CHECK makes threads faster.
Kernel is run with 128 threads.

–OS

Do both kernels use the same number of registers (check by adding [font=“Courier New”]-Xptxas=-v[/font]) to the nvcc invocation? After staring at a lot of disassembled sm_13 code recently, I could imagine that the version with __syncthreads() might use fewer registers and thus achieve higher occupancy.

You can have a look yourself by running the executable through [font=“Courier New”]cuobjdump -sass[/font] for both versions. If you find it difficult to interpret the results, post them here.

You might also try to replace the unnecessary [font=“Courier New”]__syncthreads()[/font] with [font=“Courier New”]__threadfence_block()[/font] and see what happens. Finally, you might find completely different results by using the new LLVM-based compiler in CUDA 4.1 with [font=“Courier New”]nvcc -nvvm[/font].

Thank you for your suggestions.

PTX looks identical with respect to register usage. There’re just extra [font=“Lucida Console”]bar.sync 0[/font] instructions.

I tried also timing with cudaprof, and the results are the same.

Upon more testing I found that the “advantage” of calling extra __syncthreads() disappears, if reduction is done multiple times (>13) in a loop.

The compiler installed on our system does not support -nvvm option, so I could not try that.

Here’s the source for anyone to try for themselves. kernel4 does not do __syncthreads, whereas kernel5 does.

–OS

t3.cu (10 KB)