do the following (Python syntax/Linux since this is how I do it):
t0 = time()
GPU work (kernels - I/O) goes here
t1 = time()
print ‘elapsed time’, t1-t0
you can also use cudaEvents, to record the GPU time. This usually gives more or less the same results. Essential is cudaThreadSynchronize, since kernel launches are asynchronous. If you write data to the GPU prior to kernel launch read or results back after it, that will do an implicit synchronization.
The time function in the time module in Python under Linux has a resolution of about one microsecond (essentially the resolution of the underlying Linux timers). If you do this in C, you might not want to use clock, because this measure CPU clock, but gettimeofday. This will get you elapsed wall time which is what matters for all practical estimates.
Also, starting up the GPU takes some time, especially so with code using CUDA runtime, CUBLAS etc, so running a kernel once prior to timing - in some SDK examples called “warmup” - is recommended. Then the actual timing code will be unaffected by this. On my cards startup is usually about 0.3 to 0.6 sec.
I’m not exactly sure that its a ‘fair game’ to be warming up the GPU before beginning the execution time. If you are doing this timing to report speedups allowed by the GPU compared to the CPU, then you should “record” everything, at least from the point of divergence, including cudaMemcpy’s, and ‘warmup’ kernel runs.
Now, if you just want to know the time, and arent comparing it to CPU code, then do what ever you’d like. I just thought I’d throw my two cents in there for fairness.