Hello everybody!
I’m currently profiling an OptiX 7.0 application using Nsight Compute 2019.5. The application runs on a GeForce RTX 2080 Ti board.
I encounter a whole load of stalls. According to the “details” page, section “warp state statistics”, roughly a third of the “warp cycles per issued instruction” is caused by stalls of type “misc” on average. Unfortunately, the description of such stalls is rather vague (“[…] warp […] being stalled on a miscellaneous hardware reason.”). Is there a more detailed listing of reasons for encountering such stalls similar to [1]? [1] lists reasons for stalls of type “other” (which I think is equivalent to “misc”) for compute capabilities up to 6.*, but above board implements compute capability 7.5.
Thanks for your help!
David
[1] cuda - What are "Other" Issue Stall Reasons displayed by the Nsight profiler? - Stack Overflow
Hi David,
On Volta and Turing the misc stall reasons covers waiting on hardware resources that can only be accessed from library code, e.g. through Optix, or when profiling debug code. When executing a kernel with library code mixed with your own device code, the kernel-level metrics on the Details page show the aggregated values for whole kernel execution. That makes it more challenging to determine which parts are caused by the library and which parts are under your own control.
To aid with the performance analysis in those scenarios, the Source View can be used to isolate your own code and get the stall reasons per user function. For that to work, assure you compile your device code with -lineinfo, capture a report with Nsight Compute with the Source Counters section enabled, and switch to the Source Page. The Sampling Data column shows the stall reasons per source line for your own code. You can quickly jump to the lines with the highest number of stalls using the navigation buttons on the top. If you have the latest version of Nsight Compute and your source code has multiple device functions, you can also use the [-] button to the left of the search field to aggregate the metrics per function. In many cases this can help understanding if there is further potential to reduce stalls in parts of the kernel you have direct source control over.
Hi mstrengert,
I wasn’t aware of the fact that the Details page shows aggregated values for both my own device code and library code. This also explains to me, why the figures reported in the Details page and the Source page deviate from one another. Thanks a lot for pointing this out to me, this really helped me!