For a program on RTX 3080, I see some performance metrics are weird
As you can see, the IPC is very low, on the other hand the number of executed instructions per warp is 1.5M instructions. The occupancy and multiprocessor efficiency are also low.
I can not conclude that if the SMs are busy with instructions or what? While instructions per warp are high, the “per cycle” value is really low.
Any thought about that?
From smsp__cycles_active.avg.pct_of_peak_sustained_elapsed you can determine that only 4.7% of the time SMSPs are active. On a 3080 this could mean 68 SMs x 4 SMSP/SM x 4.7% = 12.7 SMSP were busy for the full elapsed cycles. sm__warps_active.avg.pct_of_peak_sustained_active is 2.08%. RTX 3080 supports 48 warps/SM max. This helps us concluded that 1 warp per SM (== 1 SMSP) is active. So 12.7 SMs have 1 warp for the duration of the kernel.
When the SMSPs are active an instruction is issued approximate every 26 cycles. (1/smsp__inst_executed.avg.per_cycle_active).
You have not included enough information to make any conclusions. Attaching a full report would be helpful. If you cannot attach a report then I would look at
- Kernel Dimensions
There are two typical cases:
- Launching very few threads per block and very few blocks. In this case the SM has not method to hide latency and 3/4 of the SM (3 SMSP out of 4) is idle.
- Launching sufficient work to saturate the GPU but most of the warps EXIT immediately and a ~12.7 warps continue for a very long time.
Case 1 is more likely given smsp__average_inst_executed_per_warp.ratio_inst_per_warp value.