Switch statements and performance without thread divergence


I am relatively new to CUDA programming and cannot seem to comprehend a performance issue that I have stumbled on related to branching. My code sort of looks like this:

template< int mask >
device static void some_giant_func(/some args/) {
//Lots of code

void func(int control_variable, /some args/) {

switch(control_variable) {

case mask1;
some_giant_func< mask1 >(…);
case mask2:
some_gian_func< mask2 >(…);


The important thing to note is that control variable is thread independent, so there is no warp divergence. However, merely commenting out some of the branches that are never taken in a certain simulation, improves performance by 40%. This effect is not seen in the CPU version of this code (identical) compiled with CLANG.

I also notice similar behavior, but not as extreme related to if-else statements. where all threads take one branch but commenting out the irrelevant branch speeds up the simulation.

NVPROF indicates that the number of active warps improves as I comment out other branches, and the quantities l2_local_load_bytes and l2_hit_ratio decrease and increase respectively.

I was wondering if someone had encountered a similar situation before and had an explanation? I have found several topics on branches leading to divergence, but in this case all threads take the same branch.

So far I have been able to get around this issue with templated kernels (template the kernel and avoid the switch statement within the kernel code), but I would like understand what the cause of this behavior is?