I am trying to optimize my code (I’m relatively new to CUDA) and I have found that I have warp states in the “stall wait” state.
“On average, each warp of this kernel spends 13.6 cycles being stalled on a fixed latency execution dependency. This represents about 71.9% of the total average of 18.9 cycles between issuing two instructions.”
Looking into the Source Counters, I find that these occur in the lines I am calculating a value using float math functions such as expf(), sqrtf(), and logf(). Looking further, it seems that most of the stall states are happening. In the assembly, the stalls seem to be concentrated on “LDL” instructions, which gives “Load within Local Memory Window” from the pointer help message.
For the line: return A*(8.0f)(1.0f/(sqrtf(2.0fM_PIF)sigma))expf(-(E - E0)(E - E0)/(2.0fsigmasigma)) + aE + b; there are about twelve "LDL"s at the end that contribute the most to the warp stall state.
The GPU throughput is less than 50 percent, so can these LDL’s be optimized somehow or are they just part of calculating these math functions?