How to correctly measure kernel exec time?

I tried to execute the kernel in a for loop for 1000 times, put a clock() before the for loop, and a clock() after the for loop.
So 1000*kernel.exec.time = 2nd clock() - 1st clock() .

There is no random thing happening, so each run should be exact the same and produce exact the same result.

Questions are:

  1. I run the above program repeatedly, and each time, the elapsed time may vary. Stangely, though they vary, seems certain value are more probable to happen. For example in my case, I observed 0.328 sec 20 times and 0.312 sec 23 times, while a few other times observed 0.343sec, 0.359sec…
    Why is that?

  2. Some previous post said kernel should be warm up before ‘real execution’. How to do ‘warm up’? I guess not arbitrary kernel funtion can do this warm up thing… right? So how to correctly measure exec kernel time ?

  3. Visual profiler tells some number of GPU time, CPU time, how should I trust those value? For some kernel, I found the number from visual profiler match well with time measured by clock(), and some are offset by around 30%.


Helooo Yaa There… Uhh…ahh…Hmm…,

If you are in “Windows” , Use “QueryPerformanceCounter” function to do it.

I have attached a HEADER file that implements a class called “HPTimer”.

All you need to do is:

HPTimer profiler;


cudaKernel <<<grid,block >>> ();



printf("Time taken = %f seconds", profier.TimeInSeconds());

The code above makes use of “rdtsc” intel instruction OR an appropriate instruction on your platform to find the time. Its quite accurate. But INTERRUPTS and Context switches can cause problems… But it is fairly good for GPU time measurement because Interrupts and Context swtiches do NOT occur inside GPU.

And, You may need to average it over 3 to 4 runs…


I sugget you manually use “rdtsc” instruction to measure time-ticks and find a way to know the frequency of TSC and calculate the time in microseconds!! The linux kernel does NOTgive a proper way for high-resolution time measurement. Its a bane… I am assuming you are using “Intel” machine. If you are using PPC, the equivalent is “mftb” instruction…


Let me know how that works Hello…Ya…


Bye bye

Best Regards,


PS: My header file needs Platform SDK. Make sure it is installed on your computer OR you will get compile errros.
PerformanceCounter.h (587 Bytes)

Heisenberg’s uncertainity principle, May be… :-)