Reducing branches causes longer duration

I got a strange performance when writing the kernel. Part of my kernel looks like this:

    if (coalesced_roots && cur_io_group == io_group) {
       
    } else {
        u32 r2 = __brev((lid << 1) + 1) >> (32 - (deg << 1));

        u64 pos = twiddle * (r2 >> deg);
        t1 = mont256::Element::load(roots + pos * WORDS);
    }

where ‘coalesced_roots’ is an input variable set to be false. However, when I completely removed the if condition, the performance got a drawback. I profile the kernel with ncu, and it turned out that the compute throughput in the new kernel actually increased, while the total sm active cycles increased 12%, causing the longer duration. The only noticeable difference in the report is that there are more ISETP and P2R instructions in the new kernel while none in the old one. So I’m really curious about what can cause this problem and how should I resolve it.

Without code that one can compile and examine the generated SASS it is impossible to tell what might be going on. You could prepare a minimal reproducer that others can compile, run, and profile.