How to understand the short scoreboard of sass code on H100 gpu?

I’m currently profiling a CUDA kernel that implements a GEMM pipeline using cp.async PTX instructions to load matrix data on H100. To evaluate performance, I’m periodically reading the global timer (SR_GLOBALTIMERLO) and storing its value in a register, which is later written to global memory.
While analyzing overhead using Nsight Compute, I observed that the instruction located at address 0x00007f4b1b7f8fd0 — specifically @!P6 CS2R.32 R0, SR_GLOBALTIMERLO — shows a high metric for short scoreboard stalls.
According to the documentation, short scoreboard stalls are typically attributed to memory operations involving shared memory or frequent execution of special math instructions. However, this particular instruction doesn’t appear to involve either of those cases.
Could you help me understand what might be causing these short scoreboard stalls in this context, and if there are any strategies to mitigate their impact on performance?

Only NVIDIA engineers can answer this authoritatively.

My assumption is that the current explanation “Other reasons include frequent execution of special math instructions (e.g. MUFU) or dynamic branching (e.g. BRX, JMX)” should be extended to include “or access to certain special registers”.

CS2R transfers data from a special register to a general purpose register, which at least in the case of the special register being a timer is presumably equivalent to reading from an I/O port on an x86 CPU and thus handled by the MIO of the GPU.

A way to minimize these stalls would therefore be to read the timer register less frequently.

1 Like

Although somewhat dated now, Scott Gray noted this, in his work with the maxas assembler:

“S2R is a variable latency (about 20 clocks) load like instruction.”

CS2R is a fixed latency instruction. S2R as state is a variable latency instruction.

The short scoreboards are likely from the DEPBAR which is waiting on a dependency. Some instructions execute then stall the warp resulting in the stall reason on the next instruction.

3 Likes

Thanks for the response! I have one follow-up question. When I compare the code that includes the profiler (i.e., with the instruction @!P6 CS2R.32 R0, SR_GLOBALTIMERLO ) to code that doesn’t use the profiler (i.e., doesn’t have the CS2R instruction), I notice that the DEPBAR instruction is only present in the former case. Does this imply that the CS2R instruction introduces an additional dependency on prior instruction (or at least that the compiler considers it so, possibly because it requires all previous instructions to complete before the global timer register can be sampled)?

The compiler is likely treating the timestamp read as an instruction fence requiring all work submitted before the timestamp to complete. The compiler has to apply some special logic to timestamp reads or they would be optimized to completely useless locations and not measure what one would expect.

2 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.