Memory Workload Analysis related metrics

Greetings, I am trying to reconcile the output of some metrics given by Nsight Compute.

I was optimizing our kernels, and got a decent spike in performance. I am trying to identify more precisely whether it’s due to reducing thread divergence, or memory transfer overhead. That is, Warp State Statistics revealed issues with CPI Stall 'Long Scoreboard':

On average each warp of this kernel spends ... cycles being stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, texture) operation. This represents about (some percent) of the total average of ... cycles between issuing two instructions.

Introducing optimizations only marginally improved the situation with the CPI Stall 'Long Scoreboard', so I attributed the performance boost to reducing thread divergence. However, opening up Memory Workload Analysis showed a significant reduction in the number of instructions issued to L1/L2 caches, e.g. Global Load Cached, Texture Load, etc.

My question is, how closely do metrics presented in Memory Workload Analysis correlate to CPI Stall 'Long Scoreboard'? From what I gathered, the later is about memory access overhead?

In the Warp State Statistics section, the stalls indicate how many cycles warps were stalled, on average, over your kernel duration, and for a certain (stall) reason. You found correctly that Long Scoreboard stalls normally indicate that the warp was waiting on the result of a device memory load instruction (e.g. a “global load”). However, looking at stalls can only translate into performance improvements if the ‘No Eligible [%]’ metric is high. Otherwise the kernel already hides those latencies anyway. In those cases, reducing stalls basically moves warps from one stall reason to ‘Not Selected’, as the scheduler is fully busy issuing instructions already.

If you reduced the number of instructions in total, it might still be the case in your optimized kernel that these instructions are stalled on average the same number of cycles due to Long Scoreboard stalls. However, since you execute fewer instructions overall, the kernel is also faster. It might be interesting to compare e.g. the data under Instruction Statistics.

Which instructions in your code are stalled by which exact stall reason can be seen in the Source page, when selecting the “Sampling Data” metrics, or one of the individual warp stall metrics. Note though that on this page metrics are sampled over all warp schedulers and over the duration of your kernel, so it can be skewed for small kernels without enough samples.

Obviously, there can also be other explanations for the perf improvement, but this would be one I could infer from the data you have given.