I want to measure the execution time of my kernel, for that I use cudaEvents, which provides me with a value of ms order.
When I check that with nvprof I get a way smaller value: for small kernels the difference is of order 10ms, but for big kernels it’s about 10us.
This created some confusions in my approach to measure kernel execution time.
The overhead we have with cudaEvents is may linked to calling cudaEventRecord(),right ? And values recorded for nvprof for kernel execution time doesnot include these overheads ? I need to know which values I should rely on ?
Your help is needed to enlighten me .