I got a strange performance when writing the kernel. Part of my kernel looks like this:
if (coalesced_roots && cur_io_group == io_group) {
} else {
u32 r2 = __brev((lid << 1) + 1) >> (32 - (deg << 1));
u64 pos = twiddle * (r2 >> deg);
t1 = mont256::Element::load(roots + pos * WORDS);
}
where ‘coalesced_roots’ is an input variable set to be false. However, when I completely removed the if condition, the performance got a drawback. I profile the kernel with ncu, and it turned out that the compute throughput in the new kernel actually increased, while the total sm active cycles increased 12%, causing the longer duration. The only noticeable difference in the report is that there are more ISETP and P2R instructions in the new kernel while none in the old one. So I’m really curious about what can cause this problem and how should I resolve it.