Stalll reasons

The stall reasons reported by nvprof are at warp level. The reported value may be missleading because may be there are some warps ready to execute. So it is vague, how much it is worth to investigate stall reasons.

May be warp execution efficiency or eligible warps per active cycles can shed light on that. For a kernel I see

Eligible warps per active cycles = 0.44
Warp execution efficiency = 49%
Stall (exec dep) = 11%
Stall (data req) = 34%
Stall (immediate) = 20%
Stall (Fetch) = 32%

Other stall reasons are small.
I would like to know which stall type is actually the main bottleneck. For example, maybe immediate stall reasons are can be hidden with different warps. But I don’t know that from the stats above.

Any comment?

You can start from disabling swap on computer. I found that swap operation stalls any CUDA computings.

Also that 200ms GPU Stall on cuCtxCreate could also be related to