nvprof metric 'flop_dp_efficiency' is reported per SMX? or entire GPU?


I make comparison with the result by GPU and CPU with profiler, but the flops reported by nvprof seems to be quite lower than the value estimated through the comparison of elapsed time by GPU or CPU.

The metrics I use are flop_dp_efficiency and flop_sp_efficiency.

NVIDIA PROFILER USER’S GUIDE(v6.5) shows that flop_dp_efficiency is ‘Ratio of achieved to peak double-precision floating-point operations’. My question is that the reported value is per SMX or entire GPU?

I use K20c for this execution, K20c has 13 SMXs. The values should be multiplied by 13?

I really appreciate if you have any comments.

Given that this metric is a ratio, why would you expect it to be different depending on whether it is computed on a per-SMX vs a per-GPU basis? Modulo a certain amount of load imbalance between the SMXs, I would expect either metric to track the other quite closely.

Execution time on either CPU or GPU may be limited by other factors than floating-point throughput, e.g. memory throughput, so the performance ratio of the two platforms may not be reflective of their respective floating-point throughput.

It is not quite clear to me how you get from an efficiency ratio on the GPU to a meaningful comparison of floating-point throughput between CPU and GPU, could you clarify your computations in this regard? Are you taking into account that many floating-point instructions on the GPU will be FMAs, i.e. one instruction that performs two math operations?