Reported IPC is too low

Hi
For a program, I run the SOL and see

Elapsed Cycles    cycle          89,399,980
Duration          msecond        62.09

Then I run it with the IPC metric and see

smsp__inst_executed.avg.per_cycle_active      inst/cycle    0.02
smsp__thread_inst_executed.sum                inst          10,838,017,568
gpu__time_active.avg                          msecond       62.08

So, the value of IPC is really low (0.02). However,

  1. If we divide the number of instructions by cycles (10,838,017,568/89,399,980) we get 121.23.

  2. If we divide thread instructions by 32 and then divide it by the cycles, we get 3.78.

  3. If we consider that ipc metric is for smsp, we can then do 10,838,017,568/68/4 to get 39,845,652 instructions per smsp where 68 is the number of SMs in 3080 and 4 is the number of partitions in SM. Then 39,845,652/89,399,980 yields 0.44 as instructions per cycle for each smsp.

  4. If we divide the smsp instructions by 32, we get 39,845,652/32=1,245,176 and by dividing that to 89,399,980 cycles, we get 0.013 as warp instructions for smsp per cycle.

None of them is 0.02 as reported by smsp__inst_executed.avg.per_cycle_active.
So, I would like to know how the IPC metric is calculated.
My nsight version is 2020.3.0.0 (build 29307467)

smsp__inst_executed.avg.per_cycle_active == smsp__inst_executed.avg / smsp__cycles_active.avg

This is a very low IPC value. For Volta+ this is 2% of the SMSP sustain rate.

Please note that smsp__cycles_active.avg <= sm__cycles_active.avg <= sm__cycles_elapsed.avg

  • A SMSP is active if at least one warp is active on the SMSP.
  • A SM is active if at least one warp is active on the SM. 3 SMSP could be inactive during this period.
  • Elapsed cycles is the time from start of the performance counter collect to the end. If there is not enough work to saturate all SMs then sm__cycles_active.avg may be much smaller than sm__cycles_elapsed.avg.

Your calculation (4) is close but you used Elapsed Cycles instead of smsp__cycles_active.sum. It is very reasonable to believe that smsp__cycles_active.avg / Elapsed Cycles could be ~40% different increasing the value from 0.013 to 0.02.

I would recommend collecting

smsp__inst_executed.avg
smsp__cycles_active.avg

so you can manually do the calculation.

1 Like

Thanks for the detailed response. You are right. When I check
smsp__inst_executed.avg
smsp__cycles_active.avg
I can get the same value for IPC. However, according to the sm_efficiency, the SMs utilization and achieved occupancy are not very low.

    Metric Name                                           Metric Unit Minimum         Maximum         Average
    ----------------------------------------------------- ----------- --------------- --------------- ---------------
    sm__cycles_active.avg                                 cycle       78852663.250000 78852663.250000 78852663.250000
    sm__cycles_elapsed.avg                                cycle       89399770.323529 89399770.323529 89399770.323529
    sm__warps_active.avg.pct_of_peak_sustained_active     %           66.159006       66.159006       66.159006
    smsp__cycles_active.avg                               cycle       78852638.301471 78852638.301471 78852638.301471
    smsp__cycles_active.avg.pct_of_peak_sustained_elapsed %           88.202283       88.202283       88.202283
    smsp__inst_executed.avg                               inst        1591320.264706  1591320.264706  1591320.264706
    smsp__inst_executed.avg.per_cycle_active              inst/cycle  0.020181        0.020181        0.020181


I also run with a larger input to increase the grid size, however, the I get the same low IPC value while occupancy and sm efficiency have increased.

  _Z15kernel_gpu_cuda7par_str7dim_strP7box_strP11FOUR_VECTORPdS4_, Block Size 128, Grid Size 125000, Device 0, 1 invocations
    Section: Command line profiler metrics
    Metric Name                                           Metric Unit Minimum            Maximum            Average
    ----------------------------------------------------- ----------- ------------------ ------------------ ------------------
    sm__cycles_active.avg                                 cycle       11643316070.750000 11643316070.750000 11643316070.750000
    sm__cycles_elapsed.avg                                cycle       11645691365.970589 11645691365.970589 11645691365.970589
    sm__warps_active.avg.pct_of_peak_sustained_active     %           83.239864          83.239864          83.239864
    smsp__cycles_active.avg                               cycle       11643316033.621323 11643316033.621323 11643316033.621323
    smsp__cycles_active.avg.pct_of_peak_sustained_elapsed %           99.979603          99.979603          99.979603
    smsp__inst_executed.avg                               inst        234979347.606618   234979347.606618   234979347.606618
    smsp__inst_executed.avg.per_cycle_active              inst/cycle  0.020181           0.020181           0.020181

A full report or minimal reproducible would be needed to provide more feedback. The high smsp__cycles_active% and the sm__warps_active% would indicate sufficient activity and sufficient warps to hide latency. The low smsp__inst_executed.avg.per_cycle_active indicates long dependencies. I would assume the kernel has either very long latency memory operations (see smsp__average_warps_issue_stalled_long_scoreboard_per_issue_active.ratio) or on barriers (smsp__average_warps_issue_stallled_barrier.per_issue_active.ratio) or both.

1 Like

I have uploaded the binary file here. Following your suggestion on measuring stall reasons, I get this output

$ nv-nsight-cu-cli -k kernel --metrics smsp__inst_executed.avg,smsp__cycles_active.avg,smsp__inst_executed.avg.per_cycle_active,sm__cycles_active.avg,sm__cycles_elapsed.avg,smsp__cycles_active.avg.pct_of_peak_sustained_elapsed,sm__warps_active.avg.pct_of_peak_sustained_active,smsp__average_warps_issue_stalled_long_scoreboard_per_issue_active.ratio,smsp__average_warps_issue_stalled_barrier_per_issue_active.ratio,smsp__warp_issue_stalled_long_scoreboard_per_warp_active.pct,smsp__warp_issue_stalled_barrier_per_warp_active.pct,smsp__warp_issue_stalled_membar_per_warp_active.pct,smsp__warp_issue_stalled_short_scoreboard_per_warp_active.pct,smsp__warp_issue_stalled_wait_per_warp_active.pct,smsp__warp_issue_stalled_imc_miss_per_warp_active.pct,smsp__warp_issue_stalled_no_instruction_per_warp_active.pct,smsp__warp_issue_stalled_drain_per_warp_active.pct,smsp__warp_issue_stalled_lg_throttle_per_warp_active.pct,smsp__warp_issue_stalled_not_selected_per_warp_active.pct,smsp__warp_issue_stalled_dispatch_stall_per_warp_active.pct,smsp__warp_issue_stalled_misc_per_warp_active.pct,smsp__warp_issue_stalled_math_pipe_throttle_per_warp_active.pct,smsp__warp_issue_stalled_mio_throttle_per_warp_active.pct,smsp__warp_issue_stalled_sleeping_per_warp_active.pct,smsp__warp_issue_stalled_tex_throttle_per_warp_active.pct,smsp__warp_issue_stalled_membar_per_warp_active.pct -f -o test ./lavaMD -boxes1d 10
thread block size of kernel = 128
Configuration used: boxes1d = 10
==PROF== Connected to process 21582 (/home/mahmood/suites/rodinia_3.1/cuda/lavaMD/lavaMD)
==PROF== Profiling "_Z15kernel_gpu_cuda7par_str7dim_strP7box_strP11FOUR_VECTORPdS4_": 0%....50%....100% - 7 passes
Time spent in different stages of GPU_CUDA KERNEL:
 0.307810008526 s, 24.632268905640 % : GPU: SET DEVICE / DRIVER INIT
 0.000508000026 s,  0.040652327240 % : GPU MEM: ALO
 0.000915999990 s,  0.073302224278 % : GPU MEM: COPY IN
 0.939588010311 s, 75.189834594727 % : GPU: KERNEL
 0.000422000012 s,  0.033770240843 % : GPU MEM: COPY OUT
 0.000376999989 s,  0.030169147998 % : GPU MEM: FRE
Total time:
1.249621033669 s
==PROF== Disconnected from process 21582
$ nv-nsight-cu-cli --print-summary per-kernel -i test.ncu-rep                                                              [21582] lavaMD@127.0.0.1
  _Z15kernel_gpu_cuda7par_str7dim_strP7box_strP11FOUR_VECTORPdS4_, Block Size 128, Grid Size 1000, Device 0, 1 invocations
    Section: Command line profiler metrics
    Metric Name                                                              Metric Unit Minimum         Maximum         Average
    ------------------------------------------------------------------------ ----------- --------------- --------------- ---------------
    sm__cycles_active.avg                                                    cycle       78853092.779412 78853092.779412 78853092.779412
    sm__cycles_elapsed.avg                                                   cycle       89399795.911765 89399795.911765 89399795.911765
    sm__warps_active.avg.pct_of_peak_sustained_active                        %           66.147505       66.147505       66.147505
    smsp__average_warps_issue_stalled_barrier_per_issue_active.ratio         inst        0.692899        0.692899        0.692899
    smsp__average_warps_issue_stalled_long_scoreboard_per_issue_active.ratio inst        0.296693        0.296693        0.296693
    smsp__cycles_active.avg                                                  cycle       78852730.900735 78852730.900735 78852730.900735
    smsp__cycles_active.avg.pct_of_peak_sustained_elapsed                    %           88.202361       88.202361       88.202361
    smsp__inst_executed.avg                                                  inst        1591320.264706  1591320.264706  1591320.264706
    smsp__inst_executed.avg.per_cycle_active                                 inst/cycle  0.020181        0.020181        0.020181
    smsp__warp_issue_stalled_barrier_per_warp_active.pct                     %           0.176174        0.176174        0.176174
    smsp__warp_issue_stalled_dispatch_stall_per_warp_active.pct              %           0.001157        0.001157        0.001157
    smsp__warp_issue_stalled_drain_per_warp_active.pct                       %           0.000005        0.000005        0.000005
    smsp__warp_issue_stalled_imc_miss_per_warp_active.pct                    %           0.000756        0.000756        0.000756
    smsp__warp_issue_stalled_lg_throttle_per_warp_active.pct                 %           0.000162        0.000162        0.000162
    smsp__warp_issue_stalled_long_scoreboard_per_warp_active.pct             %           0.075436        0.075436        0.075436
    smsp__warp_issue_stalled_math_pipe_throttle_per_warp_active.pct          %           0.000088        0.000088        0.000088
    smsp__warp_issue_stalled_membar_per_warp_active.pct                      %           0.000000        0.000000        0.000000
    smsp__warp_issue_stalled_mio_throttle_per_warp_active.pct                %           0.000243        0.000243        0.000243
    smsp__warp_issue_stalled_misc_per_warp_active.pct                        %           0.000184        0.000184        0.000184
    smsp__warp_issue_stalled_no_instruction_per_warp_active.pct              %           0.016091        0.016091        0.016091
    smsp__warp_issue_stalled_not_selected_per_warp_active.pct                %           0.764977        0.764977        0.764977
    smsp__warp_issue_stalled_short_scoreboard_per_warp_active.pct            %           37.278718       37.278718       37.278718
    smsp__warp_issue_stalled_sleeping_per_warp_active.pct                    %           0.000000        0.000000        0.000000
    smsp__warp_issue_stalled_tex_throttle_per_warp_active.pct                %           60.509659       60.509659       60.509659
    smsp__warp_issue_stalled_wait_per_warp_active.pct                        %           0.809218        0.809218        0.809218

The memory and sync stall reasons are pretty much low. The execution dependency is in mid range. The texture cache (L1?) stall reason is about 60%. I am not sure if 60% and 35% are the only factors that causes such a low IPC. The IPC is really low. With respect to the numbers, do you think the IPC is reasoable?

One more question. In the nsight page, I see these stall metrics
stall_memory_dependency
smsp__warp_issue_stalled_long_scoreboard_per_warp_active.pct

stall_sync
smsp__warp_issue_stalled_barrier_per_warp_active.pct

But the names you mentioned are somehow different. Is there any big difference between them?

Assume RTX 3080

SM Activity

  • smsp__cycles_active.avg.pct_of_peak_sustained_elapsed = 88%
  • For 12% of the elapsed cycles the average SMSP was not busy. This would indicate a tail effect of some sort.

SM Occupancy

  • Launch <<<100, 128>>>
  • launch__waves_per_multiprocessor = 1.47
  • theoretical occupancy appears to be 10 blocks/SM = 40/48 warps = 83.3%
    • this is smsp__maximum_warps_avg_per_active_cycle = 10 (matches above)
  • achieved occupancy is sm__warps_active.avg.pct_of_peak_sustained_active = 66% ==> 32/48 warps
  • This achieved is expected due to the 1.47 waves per multiprocessor

Warp Stalls

  • 60% tex_throttle
  • 37% short_scoreboard

tex_throttle will be reported when the warp scheduler cannot issue an instruction to either the texture cache (tex or surface operation) or to FP64 pipe (on GeForce card) due to backpressure from that pipe. You have not captured a full report. The linked report does not have the instruction or pipe breakdown. For this I would look at the GPU Speed of Light to determine if FP64 pipe is busy or TEX is busy. It is possible you are latency bound on TEX which is harder to detect.

short_scoreboard will be reported when a warp is stalled on a return from one of the following pipes: LSU (for shared, special register, shuffle), CBU (branch resolution), or XU (transcendental), or FP64 (on your GPU).

LIMITED GUESS

Given the tex_throttle, short_scoreboard, and low IPC I suspect you are doing a lot of FP64 which on RTX 30 series has a 2 thread per cycle per SM throughput. This would drop the SMSP IPC to 1 FP64 instruction/64 cycles = 0.016. The FP64 unit is shared between the 4 SMSP. Each instruction is 16 cycles. Given round robin between SMSP each SMSP would be able to complete a FP64 instruction every 64 cycles.

If you capture a full report, which I highly recommend, I think you will find the FP64 unit at near 100%.

1 Like

Thanks for that. However, the FP64 seems to be normal. Please see the following results.
I wonder if there is a problem/bug with the way that IPC is reported.

$ nv-nsight-cu-cli -k kernel --metrics smsp__inst_executed.avg.per_cycle_active,smsp__inst_executed_pipe_fp64.avg.pct_of_peak_sustained_active,smsp__sass_thread_inst_executed_ops_dadd_dmul_dfma_pred_on.avg.pct_of_peak_sustained_elapsed ./lavaMD -boxes1d 10
thread block size of kernel = 128
Configuration used: boxes1d = 10
==PROF== Connected to process 30357 (/home/mnaderan/suites/rodinia_3.1/cuda/lavaMD/lavaMD)
==PROF== Profiling "_Z15kernel_gpu_cuda7par_str7dim_strP7box_strP11FOUR_VECTORPdS4_": 0%....50%....100% - 3 passes
Time spent in different stages of GPU_CUDA KERNEL:
 0.295962005854 s, 42.142059326172 % : GPU: SET DEVICE / DRIVER INIT
 0.000716999988 s,  0.102093696594 % : GPU MEM: ALO
 0.000874000019 s,  0.124448947608 % : GPU MEM: COPY IN
 0.403941005468 s, 57.517200469971 % : GPU: KERNEL
 0.000426000013 s,  0.060658186674 % : GPU MEM: COPY OUT
 0.000376000011 s,  0.053538680077 % : GPU MEM: FRE
Total time:
0.702296018600 s
==PROF== Disconnected from process 30357
[30357] lavaMD@127.0.0.1
  _Z15kernel_gpu_cuda7par_str7dim_strP7box_strP11FOUR_VECTORPdS4_, 2021-Mar-04 18:46:51, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    smsp__inst_executed.avg.per_cycle_active                                    inst/cycle                           0,02
    smsp__inst_executed_pipe_fp64.avg.pct_of_peak_sustained_active                       %                          21,62
    smsp__sass_thread_inst_executed_ops_dadd_dmul_dfma_pred_on.avg.pct_of_               %                          15,14
    peak_sustained_elapsed
    ---------------------------------------------------------------------- --------------- ------------------------------
$ nv-nsight-cu-cli -k kernel ./lavaMD -boxes1d 10                                                       thread block size of kernel = 128
Configuration used: boxes1d = 10
==PROF== Connected to process 30379 (/home/mnaderan/suites/rodinia_3.1/cuda/lavaMD/lavaMD)
==PROF== Profiling "_Z15kernel_gpu_cuda7par_str7dim_strP7box_strP11FOUR_VECTORPdS4_": 0%....50%....100% - 8 passes
Time spent in different stages of GPU_CUDA KERNEL:
 0.298682987690 s, 20.712144851685 % : GPU: SET DEVICE / DRIVER INIT
 0.000508000026 s,  0.035227213055 % : GPU MEM: ALO
 0.000913000025 s,  0.063311897218 % : GPU MEM: COPY IN
 1.141149044037 s, 79.132865905762 % : GPU: KERNEL
 0.000433999987 s,  0.030095688999 % : GPU MEM: COPY OUT
 0.000380000012 s,  0.026351064444 % : GPU MEM: FRE
Total time:
1.442067027092 s
==PROF== Disconnected from process 30379
[30379] lavaMD@127.0.0.1
  _Z15kernel_gpu_cuda7par_str7dim_strP7box_strP11FOUR_VECTORPdS4_, 2021-Mar-04 18:47:45, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           9,24
    SM Frequency                                                             cycle/nsecond                           1,44
    Elapsed Cycles                                                                   cycle                     89.401.499
    Memory [%]                                                                           %                           0,93
    SOL DRAM                                                                             %                           0,03
    Duration                                                                       msecond                          62,08
    SOL L1/TEX Cache                                                                     %                           1,05
    SOL L2 Cache                                                                         %                           0,20
    SM Active Cycles                                                                 cycle                  78.852.295,93
    SM [%]                                                                               %                          76,27
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis report section to see
          what the compute pipelines are spending their time doing. Also, consider whether any computation is
          redundant and could be reduced or moved to look-up tables.

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                       1.000
    Registers Per Thread                                                   register/thread                             46
    Shared Memory Configuration Size                                                 Kbyte                         102,40
    Driver Shared Memory Per Block                                             Kbyte/block                           1,02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                             Kbyte/block                           7,20
    Threads                                                                         thread                        128.000
    Waves Per SM                                                                                                     1,47
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the
          target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical
          occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 319 thread blocks.
          Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for
          up to 50.0% of the total kernel runtime with a lower occupancy of 20.6%. Try launching a grid with no
          partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for
          a grid.

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                             10
    Block Limit Shared Mem                                                           block                             12
    Block Limit Warps                                                                block                             12
    Theoretical Active Warps per SM                                                   warp                             40
    Theoretical Occupancy                                                                %                          83,33
    Achieved Occupancy                                                                   %                          66,18
    Achieved Active Warps Per SM                                                      warp                          31,77
    ---------------------------------------------------------------------- --------------- ------------------------------
$ nv-nsight-cu-cli -k kernel --metrics smsp__sass_thread_inst_executed_op_conversion_pred_on.sum,smsp__sass_thread_inst_executed_op_memory_pred_on.sum,smsp__sass_thread_inst_executed_op_control_pred_on.sum,smsp__sass_thread_inst_executed_op_fp32_pred_on.sum,smsp__sass_thread_inst_executed_op_fp64_pred_on.sum,smsp__sass_thread_inst_executed_op_integer_pred_on.sum,smsp__sass_thread_inst_executed_op_inter_thread_communication_pred_on.sum,smsp__sass_thread_inst_executed_op_misc_pred_on.sum,smsp__issue_active.avg.pct_of_peak_sustained_active,smsp__inst_issued.avg.per_cycle_active,smsp__thread_inst_executed_per_inst_executed.ratio ./lavaMD -boxes1d 10
thread block size of kernel = 128
Configuration used: boxes1d = 10
==PROF== Connected to process 30407 (/home/mnaderan/suites/rodinia_3.1/cuda/lavaMD/lavaMD)
==PROF== Profiling "_Z15kernel_gpu_cuda7par_str7dim_strP7box_strP11FOUR_VECTORPdS4_": 0%....50%....100% - 3 passes
Time spent in different stages of GPU_CUDA KERNEL:
 0.291828989983 s, 41.035057067871 % : GPU: SET DEVICE / DRIVER INIT
 0.000501999981 s,  0.070587903261 % : GPU MEM: ALO
 0.000921999977 s,  0.129645511508 % : GPU MEM: COPY IN
 0.417104989290 s, 58.650535583496 % : GPU: KERNEL
 0.000429000007 s,  0.060323130339 % : GPU MEM: COPY OUT
 0.000383000006 s,  0.053854916245 % : GPU MEM: FRE
Total time:
0.711170017719 s
==PROF== Disconnected from process 30407
[30407] lavaMD@127.0.0.1
  _Z15kernel_gpu_cuda7par_str7dim_strP7box_strP11FOUR_VECTORPdS4_, 2021-Mar-04 18:53:43, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    smsp__inst_issued.avg.per_cycle_active                                      inst/cycle                           0,02
    smsp__issue_active.avg.pct_of_peak_sustained_active                                  %                           2,02
    smsp__sass_thread_inst_executed_op_control_pred_on.sum                            inst                    790.896.192
    smsp__sass_thread_inst_executed_op_conversion_pred_on.sum                         inst                              0
    smsp__sass_thread_inst_executed_op_fp32_pred_on.sum                               inst                    219.520.000
    smsp__sass_thread_inst_executed_op_fp64_pred_on.sum                               inst                  7.244.416.000
    smsp__sass_thread_inst_executed_op_integer_pred_on.sum                            inst                  1.842.891.872
    smsp__sass_thread_inst_executed_op_inter_thread_communication_pred_on.            inst                              0
    sum
    smsp__sass_thread_inst_executed_op_memory_pred_on.sum                             inst                    706.616.512
    smsp__sass_thread_inst_executed_op_misc_pred_on.sum                               inst                      6.131.712
    smsp__thread_inst_executed_per_inst_executed.ratio                                                              25,04
    ---------------------------------------------------------------------- --------------- ------------------------------

The number of FP64 instructions is considerable. However, the utilization it not 100%.

The issue is likely the way it is interleaved. I’m a little surprised that it is this low. If you run a full capture and look at the Source View you should be able to identify why your code is low. I would expect all of the tex_throttle before a FP64 and the short_scoreboard to be on the dependent instruction. In order to provide more help you would need to post a minimal reproducible.

1 Like

I will check that. Thanks for the hints. I uploaded the binary and the command line in previous posts. Please find it here.