Stall reason "Long scoreboard" on instruction that does not even involve out-of-SM memory

Dear all,

I am currently profiling an application that does a lot of memory accesses. The basic pattern is:

  1. load data from global memory
  2. use data to perform lookups into a shared memory table
  3. accumulate the results

I’ve already done a lot of optimizations (e.g., using memcpy_async for reducing scoreboard dependencies). However, I am still encountering some long scoreboard dependencies. Most of them occur on the second line:

entry = decoding_table[data & (DECODING_TABLE_SIZE - 1)];
information = entry & INFORMATION_MASK;

data is loaded from shared memory before, the masks are preprocessor constants. However, I get a lot of “Long scoreboard” dependencies on the second line.
The corresponding SASS code looks as follows (with the last line causing the dependencies):

IMAD.SHL.U32 R4, R34, 0x4, RZ 
LOP3.LUT R4, R4, 0xfffc, RZ, 0xc0, !PT
LDS R30, [R4] 
LOP3.LUT R4, R30, 0x801f801f, RZ, 0xc0, !PT 

If I understand correctly, a long scoreboard in the last line means, that “R30” is not yet available. But as it is loaded from shared memory, I do not get, how this can be a “Long scoreboard”.
Any ideas on this? Am I overlooking something?

Am I assuming correctly that you are principally wondering about the fact that the stalls are reported as “long scoreboard” rather than some other type, and not the fact that there are stalls at all? If the code shown (which looks reasonable at the SASS level) is pretty much the body of the innermost loop in your kernel code (modulo the summing), it stands to reason that this code could be limited by SMEM throughput, and therefore encounters stalls on SMEM access.

You might get better or more detailed responses by posting these kind of questions in the profiler subforum, and by specifying the specific GPU on which this was observed, as details of the memory subsystem and details of the HW event counters can differ between GPU architectures.

Thank you for your answer :). Your assumption is partially correct: I would like to eliminate the stall reason and for that it would be convenient to know what it actually is (since “long scoreboard” did not make any sense to me).

The code accesses shared memory a lot (with some bank-conflicts), but the usage is still reported to be reasonably low by the profiler (less than 50%). The “summation” actually involves some math, so the most common stall reason used to be the math-pipeline (until I did some optimizations, which lead to the long-scoreboard).

I am observing this behaviour on a RTX 3060 (CC 8.6). Is there any way to move this thread over to the profiler forum or will I have to repost?

As far as I am aware moderators can move posts between forums. I certainly cannot.

Hi @EmilSchaetzle I can move this topic over for you.

@EmilSchaetzle are you using the Nsight Compute profiler?

Could you post a screenshot of the tool’s Source page, to better clarify which stalls are seen exactly on which line? You may include the surrounding code, too. It will also help if you can enable/bring into view the Register Dependencies column of the Source page, to make it easier to track the read/write register dependencies across the code.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.