Optimize CUDA kernel with low eligible warps and stall long scoreborad

paoxiaode · July 11, 2023, 8:37am

Hi all, I am trying to optimize a CUDA kernel by nsight compute, in the profile result, the report shows it is memory bound, low eligible warps and stall long scoreborad:
Roofline model:

low eligible warps:
757×276 23.8 KB

stall long scoreboard
757×383 26.6 KB

so I check the source code, I find the main contribution code of warp stall is like

loop on j：
{
int cid = col_ind[j];
result = scalar * weight[cid * hf + hfid];
}

and the warp stall happens on the access to the weight in global memory, the weight[cid * hf + hfid] accessed only once.

I wonder how can I reduce warp stall, is there anything like prefetch in global memory?

Hope for any advise!

Topic		Replies	Views
Improving 'Stall Long Scoreboard' by warp level communication CUDA Programming and Performance cuda , kernel , performance , profiling	3	3163	October 31, 2021
Warp Schedulling CUDA Programming and Performance	7	8099	October 22, 2010
Warp stalls are concentrated on "LDL" instructions Nsight Compute	3	800	April 27, 2023
Eligible/Stalled warps CUDA Programming and Performance	2	1472	June 8, 2020
Stalll reasons CUDA Programming and Performance	1	647	May 2, 2020
Warp stall reduded but performance not improved CUDA Programming and Performance nsight	0	433	October 23, 2022
CUDA Kernel Stall no instruction is very high using nsight compute Nsight Compute	1	934	May 21, 2021
Long/Short Scoreboard Stall Nsight Compute	1	1427	April 24, 2023
Short and Long scoreboard stall CUDA Programming and Performance	3	2549	March 24, 2023
Large Warp Stall When Returning From Function CUDA Programming and Performance	4	327	April 19, 2024

Optimize CUDA kernel with low eligible warps and stall long scoreborad

Related topics