Long scoreboard stall meanings?

Many GPU algorithms are fundamentally limited by the memory resources of the GPU - memory bandwidth and memory latency.

Therefore one of the most important stall reasons is Long Scoreboard stall:

I was wondering, what is the meaning of this exactly. Like, what is the timeline of an LDG instruction, and is there a way to distinguish long scoreboard stall due to

  1. Memory throughput (e.g. L1/L2/gmem throughput)
  2. Latency (e.g. cache misses, LG Throttle)
  3. Instruction throughput (in case of all memory fitting in L1, I guess a limitation may be the throughput of LDG instructions being less than floating point arithmetic instructions?).

The SMs are counting the Long Scoreboard Stalls on each cycle where they are waiting for their data read to come back from memory. They don’t have direct visibility into what is taking so long. So as a user, you need to look at other metrics to try and infer what is causing them. For example, if you have a lot of long scoreboard stall, you can look at the memory chart and tables to see if a specific bus is saturated or if there a lot of cache misses. Those are the types of things that could cause the stalls. But there isn’t a one-to-one mapping that allows the hardware to know what was the cause of any specific stall.

Alright. I guess it takes experience and intuition, and lots of experimentation. Thank you!