[Solved]Relation of elapsed_cycles_sm and kernel execution time in cuda

Below are the timing results of my cuda kernel.

$nvprof ./1
    Time(%) Time     Calls  Avg       Min       Max       Name
    43.08%  2.1803ms  1     2.1803ms  2.1803ms  2.1803ms  mm_kernel(float*, int, int)
    ... ...

2.Timing results of nvprof hardware counter:

$nvprof  --events elapsed_cycles_sm ./1
Kernel: mm_kernel(float*, int, int)
Invocations     Event Name         Min         Max         Avg
1               elapsed_cycles_sm  15530199    15530199    15530199

I am using tesla K80 and the GPU Max Clock rate is 0.82 GHz Hence the real time should be

15530199/(0.82 * 10^9) * 10^3 = 18.9 ms
  1. Timing results of metric computation
$nvprof --metrics l2_read_transactions,l2_read_throughput
l2_read_transactions    3923834 (increament per 32B)
l2_read_throughput      56.983216GB/s

hence the elasped time should be

3923834*32/56.983216GB/s = 2.20 ms

According the the results above, the real execution time of this kernel should be around 2.20 ms. And the execution time that calculated by elapsed cycles is incorrect.

so my question is what this relation between elapsed_cycles_sm and kernel execution time?

/Solved/

elapsed_cycles_sm is an aggregate value for all active sm. (elapsed_cycles_sm_1 + elapsed_cycles_sm_2 + … + )
If the number of threadblocks is greater than number of sm in hardware, then #active_sm should be equal to the number of sm (13 for K80).
And if the work assigned to sm are balance, the elapsed cycle of each sm should be the same.
Hence we have
kernel execution time = elapsed_cycles_sm /#active_sm /GPU_Clock_rate.

If the work assigned are un-balance,
kernel execution time = max(elapsed_cycles_sm_1, elapsed_cycles_sm_2, …, elapsed_cycles_sm_N) /GPU_Clock_rate.

We can turn on the aggregate mode in nvprof v7.5 with option “–aggregate on”

Hi iamkaka,

Did you find the reason for this discrepancy ? I have the same issue and I am using Tesla k80 too. I created another thread with the same question and no one has responded so far. I just stumbled onto your post. Please let me know if you did find the reason for this discrepancy between elapsed_time_sm and the time from nvprof summary mode in seconds. Thanks !