Performance of atomic operations vs branching of threads


In the following kernel, both OPTION_A and OPTION_B result in the same result. However, OPTION_A involves a branch which will have the threads diverging – but only on rare occasions. OPTION_B does not involve any branches. In OPTION_A, the atomic operation is executed rarely. In OPTION_B, the atomic operation is executed by all threads always.

Question: which option, A or B, is better for performance?

In global kernel:


           // some computation

           // x will be either 0 or 1

           // most of the time, x will be 0

           // very rarely is x 1

#ifdef OPTION_A

           if( x == 1 )


               old = atomicAdd( addr, 1 );



#ifdef OPTION_B

           old = atomicAdd( addr, x );



B should be a lot better. The cost of evaluating (x==1) is negligible against the cost of several atomic additions. You also could time it yourself though.

If you are compiling 4 Fermi and it is likely there is more than one one per warp, this might be even better:

int count = __popc(__ballot(x));

        if ((((((threadIdx.z * blockDim.y) + threadIdx.y) * blockDim.x + threadIdx.x) & 31) ==0) && (count != 0))

            atomicAdd( addr, x );