Switch statements and performance without thread divergence

mukul.rao · June 3, 2020, 6:44pm

Hi,

I am relatively new to CUDA programming and cannot seem to comprehend a performance issue that I have stumbled on related to branching. My code sort of looks like this:

template< int mask >
device static void some_giant_func(/some args/) {
//Lots of code
}

device
void func(int control_variable, /some args/) {

switch(control_variable) {

case mask1;
some_giant_func< mask1 >(…);
break;
case mask2:
some_gian_func< mask2 >(…);
break;
…
default:
some_giant_func<default_mask>(…);
}
}

The important thing to note is that control variable is thread independent, so there is no warp divergence. However, merely commenting out some of the branches that are never taken in a certain simulation, improves performance by 40%. This effect is not seen in the CPU version of this code (identical) compiled with CLANG.

I also notice similar behavior, but not as extreme related to if-else statements. where all threads take one branch but commenting out the irrelevant branch speeds up the simulation.

NVPROF indicates that the number of active warps improves as I comment out other branches, and the quantities l2_local_load_bytes and l2_hit_ratio decrease and increase respectively.

I was wondering if someone had encountered a similar situation before and had an explanation? I have found several topics on branches leading to divergence, but in this case all threads take the same branch.

So far I have been able to get around this issue with templated kernels (template the kernel and avoid the switch statement within the kernel code), but I would like understand what the cause of this behavior is?

Thanks!

Topic		Replies	Views
Wacking the CUDA performance Is this how you can screw up you CUDA CUDA Programming and Performance	16	21406	March 12, 2007
does a switch statement by thread id cause divergence CUDA Programming and Performance	5	3292	January 7, 2011
Question about divergent branching CUDA Programming and Performance	3	6504	May 21, 2009
How subject to performance loss is : if (idx < n) { .... } ? CUDA Programming and Performance	7	1610	July 13, 2015
Diverge-free doesn't win 32x over Diverge-all warp divergence CUDA Programming and Performance	6	3225	September 14, 2007
divergent branch CUDA Programming and Performance	1	1588	September 9, 2009
Must all threads execute the same code? "Branch divergence occurs only within a warp" CUDA Programming and Performance	5	3051	December 28, 2008
If loops in kernel a problem? CUDA Programming and Performance	3	1796	February 26, 2009
Shift direction and divergence CUDA Programming and Performance	7	498	November 13, 2020
Avoid branching ... CUDA Programming and Performance	3	3704	May 19, 2010

Switch statements and performance without thread divergence

Related topics