As you can see, the IPC is very low, on the other hand the number of executed instructions per warp is 1.5M instructions. The occupancy and multiprocessor efficiency are also low.
I can not conclude that if the SMs are busy with instructions or what? While instructions per warp are high, the “per cycle” value is really low.
From smsp__cycles_active.avg.pct_of_peak_sustained_elapsed you can determine that only 4.7% of the time SMSPs are active. On a 3080 this could mean 68 SMs x 4 SMSP/SM x 4.7% = 12.7 SMSP were busy for the full elapsed cycles. sm__warps_active.avg.pct_of_peak_sustained_active is 2.08%. RTX 3080 supports 48 warps/SM max. This helps us concluded that 1 warp per SM (== 1 SMSP) is active. So 12.7 SMs have 1 warp for the duration of the kernel.
When the SMSPs are active an instruction is issued approximate every 26 cycles. (1/smsp__inst_executed.avg.per_cycle_active).
You have not included enough information to make any conclusions. Attaching a full report would be helpful. If you cannot attach a report then I would look at
Kernel Dimensions
launch__grid_size
launch__block_size
There are two typical cases:
Launching very few threads per block and very few blocks. In this case the SM has not method to hide latency and 3/4 of the SM (3 SMSP out of 4) is idle.
Launching sufficient work to saturate the GPU but most of the warps EXIT immediately and a ~12.7 warps continue for a very long time.
Case 1 is more likely given smsp__average_inst_executed_per_warp.ratio_inst_per_warp value.