I’ve run nsight compute on my kernel. And I can see a large warp stall from “Stall Long Scoreboard“ and I knew how to optimize it.
However, before optimizing, is there a metric on nsight compute that it can tell me how much gain I can get after optimizing all the stalls?
For example, if I have 13.7 cycles per instruction stall long scoreboard, after optimizing it, how much gain I can get? Is there a theoretical way to get that?