Using nvvp, I have identified two lines in my kernel that gave me high divergence.
The two lines are actually pretty simple, the first line, I want to find the location of the minimum value from a float3 vector. I used
minval=fminf(fminf(htime[0],htime[1]),htime[2]);
(*minloc)=(minval==htime[0]?0:(minval==htime[1]?1:2));
where htime is a float pointer point to 3 numbers in the register space.
The “?:” operation in the second line caused divergence 90% of the time.
the second one is slightly more complicated, I call a custom math function (to replace nextafter) inside another “?:” operator:
(*minloc==0) ?
(htime[0]=mcx_nextafterf(__float2int_rn(htime[0]), (v->x > 0.f)-(v->x < 0.f))) :
((*minloc==1) ?
(htime[1]=mcx_nextafterf(__float2int_rn(htime[1]), (v->y > 0.f)-(v->y < 0.f))) :
(htime[2]=mcx_nextafterf(__float2int_rn(htime[2]), (v->z > 0.f)-(v->z < 0.f))) );
this again gave me 90% divergence.
In addition, the short function (23 lines total) that contains both of these cases is a hotspot of my code, taking about 1/10 of the run-time. Using PC sampling profiling, the function poses 64% latency due to execution dependency, 23% due to instruction fetch. I suspect those were also caused by the two ?: operators above.
My questions are,
-
is there a way to optimize the above code to avoid the divergence? I tried to use minloc as index to avoid the second ?:, but that makes my htime array in the local memory (instead of a register).
-
even I can find a way to avoid divergence in the above cases, do you think it will likely make a major impact to the execution efficiency? the expressions involved are kind of short.
happy to hear what you think about this.