[Solved]Relation of elapsed_cycles_sm and kernel execution time in cuda

iamkaka · July 8, 2016, 8:41am

Below are the timing results of my cuda kernel.

$nvprof ./1
    Time(%) Time     Calls  Avg       Min       Max       Name
    43.08%  2.1803ms  1     2.1803ms  2.1803ms  2.1803ms  mm_kernel(float*, int, int)
    ... ...

2.Timing results of nvprof hardware counter:

$nvprof  --events elapsed_cycles_sm ./1
Kernel: mm_kernel(float*, int, int)
Invocations     Event Name         Min         Max         Avg
1               elapsed_cycles_sm  15530199    15530199    15530199

I am using tesla K80 and the GPU Max Clock rate is 0.82 GHz Hence the real time should be

15530199/(0.82 * 10^9) * 10^3 = 18.9 ms

Timing results of metric computation

$nvprof --metrics l2_read_transactions,l2_read_throughput
l2_read_transactions    3923834 (increament per 32B)
l2_read_throughput      56.983216GB/s

hence the elasped time should be

3923834*32/56.983216GB/s = 2.20 ms

According the the results above, the real execution time of this kernel should be around 2.20 ms. And the execution time that calculated by elapsed cycles is incorrect.

so my question is what this relation between elapsed_cycles_sm and kernel execution time?

/Solved/

elapsed_cycles_sm is an aggregate value for all active sm. (elapsed_cycles_sm_1 + elapsed_cycles_sm_2 + … + )
If the number of threadblocks is greater than number of sm in hardware, then #active_sm should be equal to the number of sm (13 for K80).
And if the work assigned to sm are balance, the elapsed cycle of each sm should be the same.
Hence we have
kernel execution time = elapsed_cycles_sm /#active_sm /GPU_Clock_rate.

If the work assigned are un-balance,
kernel execution time = max(elapsed_cycles_sm_1, elapsed_cycles_sm_2, …, elapsed_cycles_sm_N) /GPU_Clock_rate.

We can turn on the aggregate mode in nvprof v7.5 with option “–aggregate on”

NC1 · September 8, 2016, 1:14am

Hi iamkaka,

Did you find the reason for this discrepancy ? I have the same issue and I am using Tesla k80 too. I created another thread with the same question and no one has responded so far. I just stumbled onto your post. Please let me know if you did find the reason for this discrepancy between elapsed_time_sm and the time from nvprof summary mode in seconds. Thanks !

Topic		Replies	Views
nvprof elapsed_cycles_sm vs time in milliseconds CUDA Programming and Performance	0	665	September 7, 2016
Kernel execution measurement - profiling CUDA Programming and Performance	3	236	May 5, 2024
How to accurately time individual memory operations CUDA Programming and Performance	17	6232	September 12, 2016
I don't understand the execution time (k40c & GTX580). CUDA Programming and Performance	9	2459	April 23, 2015
Discrepancy between cudaEventElapsedTime and nvprof CUDA Programming and Performance	7	1556	March 11, 2016
Kernel execution time variable execution time depending on grid CUDA Programming and Performance	1	4787	March 30, 2010
nvprof active_cycles vs elapsed_cycles_sm CUDA Programming and Performance	3	2538	August 27, 2016
CUDA OpenCL comparison CUDA Programming and Performance	9	3399	August 23, 2011
Kernel time discrepancy between nsys profile and cudaEventElapsedTime Profiling Linux Targets cuda , kernel , profiling	4	766	April 28, 2023
Precision of events for recording time elapsed of a kernel CUDA Programming and Performance	5	1176	December 21, 2017

[Solved]Relation of elapsed_cycles_sm and kernel execution time in cuda

Related topics