I make comparison with the result by GPU and CPU with profiler, but the flops reported by nvprof seems to be quite lower than the value estimated through the comparison of elapsed time by GPU or CPU.
The metrics I use are flop_dp_efficiency and flop_sp_efficiency.
NVIDIA PROFILER USER’S GUIDE(v6.5) shows that flop_dp_efficiency is ‘Ratio of achieved to peak double-precision floating-point operations’. My question is that the reported value is per SMX or entire GPU?
I use K20c for this execution, K20c has 13 SMXs. The values should be multiplied by 13?
I really appreciate if you have any comments.