Warp stalls are concentrated on "LDL" instructions

burnt · April 17, 2023, 6:42pm

I am trying to optimize my code (I’m relatively new to CUDA) and I have found that I have warp states in the “stall wait” state.

“On average, each warp of this kernel spends 13.6 cycles being stalled on a fixed latency execution dependency. This represents about 71.9% of the total average of 18.9 cycles between issuing two instructions.”

Looking into the Source Counters, I find that these occur in the lines I am calculating a value using float math functions such as expf(), sqrtf(), and logf(). Looking further, it seems that most of the stall states are happening. In the assembly, the stalls seem to be concentrated on “LDL” instructions, which gives “Load within Local Memory Window” from the pointer help message.

For the line: return A*(8.0f)(1.0f/(sqrtf(2.0fM_PIF)sigma))expf(-(E - E0)(E - E0)/(2.0fsigmasigma)) + aE + b; there are about twelve "LDL"s at the end that contribute the most to the warp stall state.

The GPU throughput is less than 50 percent, so can these LDL’s be optimized somehow or are they just part of calculating these math functions?

Greg · April 17, 2023, 9:15pm

An average “stall wait” per instruction of 13.6 cycles and a high number of LDL are indicators that you are profiling a debug kernel. Please check that you are building with optimizations enabled.

If you are building with optimizations enabled then you will likely need to post a Nsight Compute report and/or a minimal reproducible for someone to help provide actionable advice.

burnt · April 27, 2023, 8:10pm

I was profiling a debug kernel. I compiled with optimizations and the code executed much faster (about 10x) and the ‘LDL’ was no longer the main stall wait. Thank you!

system · May 11, 2023, 8:11pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Memory Workload Analysis related metrics Nsight Compute	1	1903	January 30, 2020
Stalll reasons CUDA Programming and Performance	1	596	May 2, 2020
Eligible/Stalled warps CUDA Programming and Performance	2	1313	June 8, 2020
How to know my kernel if Pipeline parallel by nsight compute Nsight Compute	6	884	April 18, 2023
How to keep the float pipe busy? CUDA Programming and Performance	7	708	April 23, 2019
Kernel with very low eligible warps despite fully coalesced memory access CUDA Programming and Performance	7	1066	July 17, 2023
Stall reasons summation is not 100% Nsight Compute	7	1023	October 12, 2021
Reasons for encountering stalls of type "misc" Nsight Compute	2	870	January 20, 2020
What cause dispatch stall? How to avoid it? Nsight Compute cuda	11	1778	February 9, 2023
Case study: [TensorCore backed Conv] What makes a huge "Stall Wait"? Deep Learning (Training & Inference) mixed-precision	0	694	April 25, 2019

Warp stalls are concentrated on "LDL" instructions

Related topics