I get large warp stall(~160) while returning from__device__
function according to nsight compute.
RET.ABS.NODEC R20 0x0
What is the reason? how can i avoid this?
I get large warp stall(~160) while returning from__device__
function according to nsight compute.
RET.ABS.NODEC R20 0x0
What is the reason? how can i avoid this?
Depending on the GPU, it might be that the warp needs to return “together”. In that case, try a newer GPU, or make the work done by each thread in the warp approximately the same, so that the threads all finish together.
I am on a RTX3080. So can we say that this is a thread divergence issue?
I guess if I wanted to know the reason for warp stalls on a particular line of code, I would use nsight compute to answer that. Use the nsight compute source page. Find the line of code. It should give a color-coded pareto of warp stall reasons on that line.
It is almost impossible to provide help without a minimal reproducible or the report. Providing the SASS instruction and a sample count is not helpful. The sample count will list the warp stall reason.
When determining if you have a performance issue it is useful to compare the relative # of samples at the instruction to all samples for both the Warp Stall Sampling (All Cycles / All Samples) and the Warp Stall Sampling (No Issue) columns. If No Issue is low as a % of all samples then it is not worth investigating as you are unlikely to gain performance as you have sufficient warps at the current point to cover stalls.