Very strange sampling data in Nsight Compute

I am using Nsight Compute to analyze program performance bottlenecks. See the screenshot below:

To prevent issues with image opening, I am providing the contents of the screenshot in text format. First, here is the SASS code:

|168|00007f85 83b36b70|      IMAD.WIDE.U32 R20, R30, R39, RZ |
|169|00007f85 83b36b80|      MOV R29, 0x4 |
|170|00007f85 83b36b90|      IMAD.WIDE.U32 R34, R12, R39, RZ |
|171|00007f85 83b36ba0|      IMAD R13, R20, c[0x3][0x48], RZ |
|172|00007f85 83b36bb0|      IMAD.WIDE.U32 R40, R16, R29, c[0x0][0x170] |
|173|00007f85 83b36bc0|      IMAD.WIDE.U32 R50, R10, R39, RZ |
|174|00007f85 83b36bd0|      LDG.E R22, [R40.64] |
|175|00007f85 83b36be0|      IMAD.WIDE.U32 R34, P0, R13, -0x7af74000, R34 |
|176|00007f85 83b36bf0|      IMAD.HI.U32 R57, P2, R13, 0x1, R20 |
|177|00007f85 83b36c00|      IMAD.WIDE.U32 R42, R16, R29, c[0x0][0x178] |
|178|00007f85 83b36c10|      IMAD.WIDE.U32 R52, R11, R39, RZ |
|179|00007f85 83b36c20|      LDG.E R23, [R42.64] |
|180|00007f85 83b36c30|      IMAD.WIDE.U32.X R40, P1, R13, 0x170b5d44, R50, P0 |
|181|00007f85 83b36c40|      IADD3 R34, P0, R34, R57, RZ |
|182|00007f85 83b36c50|      IMAD.WIDE.U32 R36, R16, R29, c[0x0][0x168] |
|183|00007f85 83b36c60|      IMAD.WIDE.U32 R44, R16, R29, c[0x0][0x180] |

The sampling result of the instruction at line 180 is as follows:

Total Sample Count: 8482
75.65% Long Scoreboard(6417)
14.87% Math Pipe Throttle(1261)
5.35% Not Selected(454)
2.12% Wait(180)
1.26% Dispatch(107)
0.74% Selected(63)

I was surprised to see so many Long Scoreboard samples at this line of SASS code. The three registers required for this instruction, R13, R50, and R51, are all output registers of previous computational instructions, with R13 being the output of the instruction at line 171, and R50 and R51 being the output values of the instruction at line 173. Therefore, there should not be any Long Scoreboard sampling results here.

Remote OS: Ubuntu 18.04.6 LTS
Cuda tool kit on Remote OS: Cuda compilation tools, release 12.1, V12.1.105, Build cuda_12.1.r12.1/compiler.32688072_0
Local OS: MacOS 13.4.1 (c) (22F770820d)
Local Nsight Compute: 2023.2.2.0(build 33188574)

It is probably best to ask about such specific issues with Nsight Compute in the subforum dedicated to it, because that is where the Nsight experts presumably hang out.

Independent of the assignment of stalls to particular instructions, a high prevalence of Long Scoreboard stalls in this code region seems entirely plausible: such stalls indicate that hardware is waiting for memory to return data, and the code region shown is memory bound with about 20% of the instructions being loads.

Perhaps I have some misunderstanding about how the sampling data is collected. According to the description in Warp Sampling, the sampler obtains the PC value of an randomly selected active warp at a fixed interval as well as the state of the warp scheduler. For a warp scheduler that is responsible for scheduling multiple warps, how is its state determined? If this state is indeed the state of the active warp selected by the sampler, then for the warp at the PC value of the SASS instruction at line 180, since its three input registers are all outputs of previous instructions, it should not be in a Long Scoreboard Stall state due to the need to wait for data to be loaded from global memory.

Can you share which GPU you’re running on? There can be differences depending on the chip. One thing that could be happening is that the R40 read on line 174 hasn’t completed before it needs to be written on line 180.

Sorry, I forgot to provide which GPU I’m using. The GPU I am using is the Geforce 3080 graphics card.

Any update?

Just from the screenshots and GPU, I can’t tell if that is definitely causing the issue. Would you be able to attach the Nsight Compute report here for further analysis?

I cannot directly upload files of type .ncu-rep. After clicking ‘upload,’ it prompts that only authorized file format can be uploaded. Therefore, I have uploaded the .ncu-rep file you need to another website, and here is the link to the download page:
ncu-profile.ncu-rep Download page
The above profile file is obtained through retesting on Ubuntu 20.04, exhibiting the same phenomenon.