Most accurate method of timing CUDA kernels and related memory operations

Ran into the situation where an application timer uses “gettimeofday” for timing both CUDA kernels and the time needed to copy memory back and forth from CPU to GPU.

This is the function used;

Looking at the results and comments of other users for this site I suspect that this timing method is not accurate.

In Windows I generally use the OS timing function in addition to the timing results from nvprof. It seems to me that the results from nvprof are the most accurate, but this is just speculation and I would like some feedback on a method which will really stand up to any critique.

This is a situation where milliseconds matter and the burden of proof is upon me to show the site operators that the “gettimeofday” may be missing some timed related to memory operations and that nvprof is a more accurate method which is less likely to be manipulated.

In my experience the Windows timer matches the times in nvprof, but for my results I always use the nvprof times because I assume the NVIDIA got it right.


Here is the timing code I have used for about a dozen years, with CUDA and otherwise. As you can see for Linux it is based on gettimeofday(). This is accurate down to 1 microsecond (mostly on account of the limited precision of the ‘double’ type, the OS timers often have resolution better than that). If you see big fluctuations in timing, it could be due to a lack of warm-up causing various cache and TLB stalls, inefficiencies in the I/O calls, etc).

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#include <windows.h>
double second (void)
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
#error unsupported platform


Would nvprof (via --print-gpu-trace) be any more/less accurate than this method? Is there any profiling overhead included in the nvprof output times?

I am starting to suspect that the site (and the testing application) are not including a good chunk of the device-host memory copy times, while I have been including these times in my total. If true then this will shave about 15% off my full application results at a minimum.

In my (limited) experience, nvprof times are accurate and usually match well with results from timing functions like the ones I posted.

Overhead may always be an issue when timing extremely short events. When doing one’s own timing, one can use calibration by measuring the time elapsed between timer calls, e.g. overhead = seconds() - seconds(). In addition, I usually make sure the timer code is presently in the cache by using the sequence: start = seconds(); start = seconds(); [code to be timed] stop = seconds(); elapsed = stop - start - overhead; . What kind of calibration process nvprof uses, I do not know.

If you suspect that resolution rather than overhead is the underlying issue, you could use “statistical” timing. Assume you measure the time taken by a particular piece of code 100 times, and observe a counter increment of 1 tick fifty-five times, and 0 ticks fourty-five times, then the elapsed time would be 0.55 ticks. In the past, I used this technique successfully to get cycle-accurate timings utilizing just a millisecond timer. The other alternative is obviously to replicate the code under test N times, but that can be much more intrusive (different cache hit rates etc).

Note that in a comment in the Stackoverflow thread you pointed at someone opines that one should never use gettimeofdday() for timing, but rather clock_gettime(). That’s news to me, and I am not sure I buy their arguments, but it may be worth exploring.


One more thought: When timing memory operations, there is often considerable variability due to the complex nature of modern memory hierarchies, and the many possible interactions between the various layers. Standard benchmarks such as STREAM therefore employ a best-of-ten-runs strategy, e.g. they measure ten runs and report performance based on the fastest one. This makes for very stable timings and therefore easy comparisons, however it leaves open the question how much variability will occur in real-life scenarios.

I regularly use gettimeofday as a host-based timing function on linux. I have never witnessed anything that looked odd about the results it produces. I’d be very much surprised if it had more than a microsecond or two of error, (whatever that means, since we haven’t defined it.) I’ve never done careful analysis or benchmarking, but the results are generally consistent with what I see from nvprof.

I’m not sure what kind of accuracy or scale of timing differences you are considering.

Here is the snippet I generally drop in my test stuff:

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;

I think usage is pretty straightforward. Seems to be pretty similar to what njuffa posted. I don’t do any float arithmetic until I am actually printing times out. I carry time around inside my code as unsigned 64-bit integer quantities.

unsigned long long my_time = dtime_usec(0);
cudaDeviceSynchronize(); //if I am timing device code
my_time = dtime_usec(my_time);

printf("elapsed time = %fs\n", my_time/(float)USECPSEC);

It ends up the reference timing application does not include the memory copies back from the device to host in the total time, and my test version was including this time.

In the end a positive development since that time is over 400 ms with pinned memory so I can subtract that off my algorithm running times and will move my implementation up a place in the most competitive category.

So that method of application timing, assuming correct placement of cudaDeviceSynchronize(), is correct for such purposes.