Using nvvp, I have identified two lines in my kernel that gave me high divergence.
The two lines are actually pretty simple, the first line, I want to find the location of the minimum value from a float3 vector. I used
where htime is a float pointer point to 3 numbers in the register space.
The “?:” operation in the second line caused divergence 90% of the time.
the second one is slightly more complicated, I call a custom math function (to replace nextafter) inside another “?:” operator:
(*minloc==0) ? (htime=mcx_nextafterf(__float2int_rn(htime), (v->x > 0.f)-(v->x < 0.f))) : ((*minloc==1) ? (htime=mcx_nextafterf(__float2int_rn(htime), (v->y > 0.f)-(v->y < 0.f))) : (htime=mcx_nextafterf(__float2int_rn(htime), (v->z > 0.f)-(v->z < 0.f))) );
this again gave me 90% divergence.
In addition, the short function (23 lines total) that contains both of these cases is a hotspot of my code, taking about 1/10 of the run-time. Using PC sampling profiling, the function poses 64% latency due to execution dependency, 23% due to instruction fetch. I suspect those were also caused by the two ?: operators above.
My questions are,
is there a way to optimize the above code to avoid the divergence? I tried to use minloc as index to avoid the second ?:, but that makes my htime array in the local memory (instead of a register).
even I can find a way to avoid divergence in the above cases, do you think it will likely make a major impact to the execution efficiency? the expressions involved are kind of short.
happy to hear what you think about this.