Below are the timing results of my cuda kernel.

```
$nvprof ./1
Time(%) Time Calls Avg Min Max Name
43.08% 2.1803ms 1 2.1803ms 2.1803ms 2.1803ms mm_kernel(float*, int, int)
... ...
```

2.Timing results of nvprof hardware counter:

```
$nvprof --events elapsed_cycles_sm ./1
Kernel: mm_kernel(float*, int, int)
Invocations Event Name Min Max Avg
1 elapsed_cycles_sm 15530199 15530199 15530199
```

I am using tesla K80 and the GPU Max Clock rate is 0.82 GHz Hence the real time should be

```
15530199/(0.82 * 10^9) * 10^3 = 18.9 ms
```

- Timing results of metric computation

```
$nvprof --metrics l2_read_transactions,l2_read_throughput
l2_read_transactions 3923834 (increament per 32B)
l2_read_throughput 56.983216GB/s
```

hence the elasped time should be

```
3923834*32/56.983216GB/s = 2.20 ms
```

According the the results above, the real execution time of this kernel should be around 2.20 ms. And the execution time that calculated by elapsed cycles is incorrect.

so my question is what this relation between** elapsed_cycles_sm** and **kernel execution time**?

/*Solved*/

elapsed_cycles_sm is an aggregate value for all active sm. (elapsed_cycles_sm_1 + elapsed_cycles_sm_2 + … + )

If the number of threadblocks is greater than number of sm in hardware, then #active_sm should be equal to the number of sm (13 for K80).

And if the work assigned to sm are balance, the elapsed cycle of each sm should be the same.

Hence we have

kernel execution time = elapsed_cycles_sm /#active_sm /GPU_Clock_rate.

If the work assigned are un-balance,

kernel execution time = max(elapsed_cycles_sm_1, elapsed_cycles_sm_2, …, elapsed_cycles_sm_N) /GPU_Clock_rate.

We can turn on the aggregate mode in nvprof v7.5 with option “–aggregate on”