Description of stalls in nvprof

The execution stall reason described in [1] is not complete as I see two more metrics in compute compatibility 5. They are

Warp not selected (stall_not_selected)
Miscellaneous (stall_other)

The description in [2] is not very meaningful.
1- In what circumstances a warp is not selected? For example, if is waiting for a data from memory (load/store), then stall_memory_dependency answers that. Or if is waiting for an instruction fetch, then stall_inst_fetch answers that.

2- What does “other” mean exactly? Any example for that?

[1] https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/issueefficiency.htm
[2] https://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference-5x

Presumably this means there were more warps ready to issue than there were execution resources available. In cases of such conflicts, I would expect the scheduler to pick the “oldest ready” warp but that is pure speculation.

In practical terms, this doesn’t seem like something for programmers to worry about. Are you seeing a lot of these?

Miscellaneous stalls are likely any number of fairly rare events that are too difficult to explain, especially since NVIDIA doesn’t document the GPU microarchitecture in detail, thus providing little context.

“Other” reasons are outlined here:

[url]cuda - What are "Other" Issue Stall Reasons displayed by the Nsight profiler? - Stack Overflow

@njuffa
Fortunately, the percentage low (less than 2%).

@Robert_Crovella
Thank for sharing that. It was useful.