Large Warp Stall When Returning From Function

I get large warp stall(~160) while returning from__device__ function according to nsight compute.

RET.ABS.NODEC R20 0x0

What is the reason? how can i avoid this?

Depending on the GPU, it might be that the warp needs to return “together”. In that case, try a newer GPU, or make the work done by each thread in the warp approximately the same, so that the threads all finish together.

I am on a RTX3080. So can we say that this is a thread divergence issue?

I guess if I wanted to know the reason for warp stalls on a particular line of code, I would use nsight compute to answer that. Use the nsight compute source page. Find the line of code. It should give a color-coded pareto of warp stall reasons on that line.

It is almost impossible to provide help without a minimal reproducible or the report. Providing the SASS instruction and a sample count is not helpful. The sample count will list the warp stall reason.

When determining if you have a performance issue it is useful to compare the relative # of samples at the instruction to all samples for both the Warp Stall Sampling (All Cycles / All Samples) and the Warp Stall Sampling (No Issue) columns. If No Issue is low as a % of all samples then it is not worth investigating as you are unlikely to gain performance as you have sufficient warps at the current point to cover stalls.