While profiling a kernel, I came across these two different metrics and I was wondering wither they are the same?

Because I am getting different results for them on the same kernel.

flop_sp_efficiency = flop_count / (elapsed_cycles_sm * SmPeakFlopPerCycle * NumSM)

SmPeakFlopPerCycle = FMA (2) * SubpartitionsPerSm (4) = 8

If a kernel was a large sequence of FADD then maximum flop_sp_efficiency is 50%.

If a kernel was a large sequence of FFMA the maximum flop_sp_efficiency is < 3% (1/32).

single_precision_fu_utilization is a percentage of cycles the FP32 pipes were active and the SM was active.

Unfortunately the metric description does not state that one is based upon elapsed cycles and the other active cycles.

flop_sp_efficiency is useful to compare against maximum throughput.

single_precision_fu_utilization is useful to determine if the SM is limited by FP32 throughput.

flop_sp_efficiency is < single_precision_fu_utilization by definition.

If efficiency is low and utilization is high then

- not all SMs are active, or
- warps have inactive threads or predicated off threads, or
- kernel is executing lower op count FP32 instructions (FADD, FMUL),
- or a combination of all of the above.

Thanks Greg for your reply.

for this statement:

Does that mean it measures utilization for one SM only?

Becuase I have an experiment I asked a question about here that might be explained with that, I would appreciate it if you can take a look at it.

https://devtalk.nvidia.com/default/topic/1041815/cuda-programming-and-performance/roofline-model-for-nvidia-gtx1080-/

No. Counters are summed across all SMs. If you launch 1 thread block on a 4 SM system the counter values could be

ffma_executed 1000, 0, 0, 0

active_cycles 1000, 0, 0, 0

elapsed_cycles_sm 1000, 1000, 1000, 1000

utilization = (1000 + 0 + 0 + 0) / (1000 + 0 + 0 + 0) * 100 = 100%

efficiency = ((1000 + 0 + 0 + 0) * 32 * 2) / ((1000 + 1000 + 1000 + 1000) * 32 * 2) * 100 = 25%

If a kernel was a large sequence of FADD then maximum flop_sp_efficiency is 50%.

If a kernel was a large sequence of FFMA the maximum flop_sp_efficiency is < 3% (1/32).

This doesnâ€™t seem to make sense. Based on the formula given earlier, a lengthy sequence of sufficiently independent `FFMA`

instructions should lead to an efficiency near 100%. `FFMA`

executes at the same rate as `FADD`

, and comprises twice the number of floating-point operations.

In the fomula

`flop_sp_efficiency = flop_count / (elapsed_cycles_sm * SmPeakFlopPerCycle * NumSM)`

`flop_count`

has two variables per instruction:

- predicated true threads per instruction (0-32 per warp instruction)
- flops per thread per instruction (FADD = FMUL = 1, FFMA = 2)

`SmPeakFlopPerCycle`

requires all FP32 instructions issued to be FFMA (2 flops per thread per instruction).

*If a kernel was a large sequence of FADD then maximum flop_sp_efficiency is 50%.*

If all 32 threads are active and predicated true then flop_count variable 1 is maximized.

If the instructions are all FADDs then the flop_count variable 2 is 1 which is 50% of the maximum.

The result is a maximum of 50% efficiency. Changing all FADD to FMA would result in 100%.

*If a kernel was a large sequence of FFMA the maximum flop_sp_efficiency is < 3% (1/32).*

If only 1 thread is predicated true (all 32 may be active) then the maximum efficient is 1/32.

If the instructions are all FMAs then flop_count variable 2 is maximized.

The result is a maximum efficient of 1/32.

Thanks for the clarification. Previously it wasnâ€™t clear to a casual reader like me that â€ś(1/32)â€ť refers to the case where only one of out 32 threads in a warp is active.