Warp stalls are concentrated on "LDL" instructions

I am trying to optimize my code (I’m relatively new to CUDA) and I have found that I have warp states in the “stall wait” state.

“On average, each warp of this kernel spends 13.6 cycles being stalled on a fixed latency execution dependency. This represents about 71.9% of the total average of 18.9 cycles between issuing two instructions.”

Looking into the Source Counters, I find that these occur in the lines I am calculating a value using float math functions such as expf(), sqrtf(), and logf(). Looking further, it seems that most of the stall states are happening. In the assembly, the stalls seem to be concentrated on “LDL” instructions, which gives “Load within Local Memory Window” from the pointer help message.

For the line: return A*(8.0f)(1.0f/(sqrtf(2.0fM_PIF)sigma))expf(-(E - E0)(E - E0)/(2.0fsigmasigma)) + aE + b; there are about twelve "LDL"s at the end that contribute the most to the warp stall state.

The GPU throughput is less than 50 percent, so can these LDL’s be optimized somehow or are they just part of calculating these math functions?

An average “stall wait” per instruction of 13.6 cycles and a high number of LDL are indicators that you are profiling a debug kernel. Please check that you are building with optimizations enabled.

If you are building with optimizations enabled then you will likely need to post a Nsight Compute report and/or a minimal reproducible for someone to help provide actionable advice.

I was profiling a debug kernel. I compiled with optimizations and the code executed much faster (about 10x) and the ‘LDL’ was no longer the main stall wait. Thank you!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.