Performance metrics on 3080

mahmood.nt · February 10, 2021, 10:27pm

For a program on RTX 3080, I see some performance metrics are weird

sm__warps_active.avg.pct_of_peak_sustained_active
achieved_occupancy
2.083333

smsp__average_inst_executed_per_warp.ratio
inst_per_warp
1598154.8

smsp__cycles_active.avg.pct_of_peak_sustained_elapsed
sm_efficiency
4.779219

smsp__inst_executed.avg.per_cycle_active
ipc
0.038303

As you can see, the IPC is very low, on the other hand the number of executed instructions per warp is 1.5M instructions. The occupancy and multiprocessor efficiency are also low.

I can not conclude that if the SMs are busy with instructions or what? While instructions per warp are high, the “per cycle” value is really low.

Any thought about that?

Greg · March 1, 2021, 5:15pm

From smsp__cycles_active.avg.pct_of_peak_sustained_elapsed you can determine that only 4.7% of the time SMSPs are active. On a 3080 this could mean 68 SMs x 4 SMSP/SM x 4.7% = 12.7 SMSP were busy for the full elapsed cycles. sm__warps_active.avg.pct_of_peak_sustained_active is 2.08%. RTX 3080 supports 48 warps/SM max. This helps us concluded that 1 warp per SM (== 1 SMSP) is active. So 12.7 SMs have 1 warp for the duration of the kernel.

When the SMSPs are active an instruction is issued approximate every 26 cycles. (1/smsp__inst_executed.avg.per_cycle_active).

You have not included enough information to make any conclusions. Attaching a full report would be helpful. If you cannot attach a report then I would look at

Kernel Dimensions
- launch__grid_size
- launch__block_size

There are two typical cases:

Launching very few threads per block and very few blocks. In this case the SM has not method to hide latency and 3/4 of the SM (3 SMSP out of 4) is idle.
Launching sufficient work to saturate the GPU but most of the warps EXIT immediately and a ~12.7 warps continue for a very long time.

Case 1 is more likely given smsp__average_inst_executed_per_warp.ratio_inst_per_warp value.

Topic		Replies	Views
Achieved occupancy reported at nsight compute Nsight Compute	2	973	July 23, 2021
Max IPC of 3080 CUDA Programming and Performance	4	666	October 12, 2021
Reported IPC is too low Nsight Compute	9	1119	March 5, 2021
Difference sm__cycles_elapsed/smsp__cycles_elapsed and sm__inst_executed/smsp__inst_executed? Nsight Compute performance-metrics	6	1852	February 16, 2022
IPC at device level Nsight Compute	3	639	October 26, 2021
nvprof: Question about the sm_efficiency metric Visual Profiler and nvprof	1	2554	April 8, 2019
Questions about the sm_efficiency metric CUDA Programming and Performance	1	826	April 7, 2019
Stall cycle of SM Nsight Compute	1	576	July 15, 2020
What exactly does SM Active Cycles mean? Nsight Compute	3	777	July 30, 2024
How to profile overall SM utilization of the program by Nsight Compute? Nsight Compute	9	2081	July 27, 2023

Performance metrics on 3080

Related topics