Greetings, I am trying to reconcile the output of some metrics given by Nsight Compute.
I was optimizing our kernels, and got a decent spike in performance. I am trying to identify more precisely whether it’s due to reducing thread divergence, or memory transfer overhead. That is,
Warp State Statistics revealed issues with
CPI Stall 'Long Scoreboard':
On average each warp of this kernel spends ... cycles being stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, texture) operation. This represents about (some percent) of the total average of ... cycles between issuing two instructions.
Introducing optimizations only marginally improved the situation with the
CPI Stall 'Long Scoreboard', so I attributed the performance boost to reducing thread divergence. However, opening up
Memory Workload Analysis showed a significant reduction in the number of instructions issued to L1/L2 caches, e.g.
Global Load Cached,
Texture Load, etc.
My question is, how closely do metrics presented in
Memory Workload Analysis correlate to
CPI Stall 'Long Scoreboard'? From what I gathered, the later is about memory access overhead?