I am using Nsight Compute to analyze program performance bottlenecks. See the screenshot below:
To prevent issues with image opening, I am providing the contents of the screenshot in text format. First, here is the SASS code:
|168|00007f85 83b36b70| IMAD.WIDE.U32 R20, R30, R39, RZ |
|---|---|---|
|169|00007f85 83b36b80| MOV R29, 0x4 |
|170|00007f85 83b36b90| IMAD.WIDE.U32 R34, R12, R39, RZ |
|171|00007f85 83b36ba0| IMAD R13, R20, c[0x3][0x48], RZ |
|172|00007f85 83b36bb0| IMAD.WIDE.U32 R40, R16, R29, c[0x0][0x170] |
|173|00007f85 83b36bc0| IMAD.WIDE.U32 R50, R10, R39, RZ |
|174|00007f85 83b36bd0| LDG.E R22, [R40.64] |
|175|00007f85 83b36be0| IMAD.WIDE.U32 R34, P0, R13, -0x7af74000, R34 |
|176|00007f85 83b36bf0| IMAD.HI.U32 R57, P2, R13, 0x1, R20 |
|177|00007f85 83b36c00| IMAD.WIDE.U32 R42, R16, R29, c[0x0][0x178] |
|178|00007f85 83b36c10| IMAD.WIDE.U32 R52, R11, R39, RZ |
|179|00007f85 83b36c20| LDG.E R23, [R42.64] |
|180|00007f85 83b36c30| IMAD.WIDE.U32.X R40, P1, R13, 0x170b5d44, R50, P0 |
|181|00007f85 83b36c40| IADD3 R34, P0, R34, R57, RZ |
|182|00007f85 83b36c50| IMAD.WIDE.U32 R36, R16, R29, c[0x0][0x168] |
|183|00007f85 83b36c60| IMAD.WIDE.U32 R44, R16, R29, c[0x0][0x180] |
The sampling result of the instruction at line 180 is as follows:
Total Sample Count: 8482
75.65% Long Scoreboard(6417)
14.87% Math Pipe Throttle(1261)
5.35% Not Selected(454)
2.12% Wait(180)
1.26% Dispatch(107)
0.74% Selected(63)
I was surprised to see so many Long Scoreboard samples at this line of SASS code. The three registers required for this instruction, R13, R50, and R51, are all output registers of previous computational instructions, with R13 being the output of the instruction at line 171, and R50 and R51 being the output values of the instruction at line 173. Therefore, there should not be any Long Scoreboard sampling results here.
Remote OS: Ubuntu 18.04.6 LTS
Cuda tool kit on Remote OS: Cuda compilation tools, release 12.1, V12.1.105, Build cuda_12.1.r12.1/compiler.32688072_0
Local OS: MacOS 13.4.1 (c) (22F770820d)
Local Nsight Compute: 2023.2.2.0(build 33188574)