Should we rely on events recording or nvprof values for kernel execution time ?

Hello everyone,

I want to measure the execution time of my kernel, for that I use cudaEvents, which provides me with a value of ms order.

When I check that with nvprof I get a way smaller value: for small kernels the difference is of order 10ms, but for big kernels it’s about 10us.

This created some confusions in my approach to measure kernel execution time.

The overhead we have with cudaEvents is may linked to calling cudaEventRecord(),right ? And values recorded for nvprof for kernel execution time doesnot include these overheads ? I need to know which values I should rely on ?

Your help is needed to enlighten me .
Thank you

It’s generally recommended to use the profiler reported times. cudaEvent can give unexpected results in multi-streamed settings and/or on windows WDDM, or on linux if the GPU is driving a display.

Thank you for this clarification .

Regarding “multi-streamed settings”, each event is tied to a particular stream, are you saying there are some bugs in the cudaEvent timings?

I’ve seen a lot of discrepancies betweeen cudaEvent timings and nvprof, but generally the average kernel time of 100s/1000s of runs are quite close.

Having “in-application” performance measurement possibilties is a huge benefit for developers, it would great if NVIDIA could address the discrepancy between the event timers and nvprof. Explaining how and why would go a long way to assess the trustworthiness/usefulness of the event timers.


No I’m not saying there are bugs.

I’m referring to the same hazard that ArchaeaSoftware indicates in the last paragraph of the answer here:

The hazard is similar in nature to what can happen on WDDM or a display GPU - you can end up timing things you don’t expect.

The suggested way to ask for a change in CUDA behavior is to file a bug.