Hello,
In the following kernel, both OPTION_A and OPTION_B result in the same result. However, OPTION_A involves a branch which will have the threads diverging – but only on rare occasions. OPTION_B does not involve any branches. In OPTION_A, the atomic operation is executed rarely. In OPTION_B, the atomic operation is executed by all threads always.
Question: which option, A or B, is better for performance?
In global kernel:
{
// some computation
// x will be either 0 or 1
// most of the time, x will be 0
// very rarely is x 1
#ifdef OPTION_A
if( x == 1 )
{
old = atomicAdd( addr, 1 );
}
#endif
#ifdef OPTION_B
old = atomicAdd( addr, x );
#endif
}