About if-else between warps

hello, NV experts
the performance will be very poor, if there is if-else in warp, like this:

val= global[lane_idx];
if(val >= 16){
    ......
    ......
}else{
    ......
    ......
}

now, the if-else is not occured in warps, it appeared between warps, like this:

int warp_idx = threadIdx.x / 32;
if(warp_idx > 1){
    ......
    function for warp 0&1
    ......
}else{
    ......
    function for warp 2&3
    ......
}

I am not sure the effection of above code.
I found there is not any poor effection in my application, I found only the code’s size become bigger.
I’m not sure how it behave on other CUDA-ARCH(my arch is 8.6, ampere)。
So, how to evaluate above code?

A data-drivenif-then-else does not necessarily cause a performance problem. The compiler may apply if-conversion, or if the branch is retained, cases of actual branch divergence may be rare.

My usual recommendation is to write CUDA code in a natural fashion, and start worrying about branch divergence only when the CUDA profiler indicates it is a non-trivial detractor from application-level performance.

2 Likes

thank you