1% variation in kernel timings cuda-5.0/samples/bin/linux/release/clock GeForce 295 GTX

I was hoping to use the kernel clock() function to get accurate (reproducible) timings
of my kernel. So far I have been running the samples example from CUDA 5.0
Averaging over 30 runs on a pretty much idle Linux PC the example kernel
takes about 13700 ticks but the sample standard deviation is 130 (ie about 1%).
Is this expected?
Why does the on board GPU times vary so much?
Is there anything I can do (to my kernels) to reduce the variation?
Thank you

I consider 1% pretty accurate. I guess to achieve that you have already made sure the GPU does not run the GUI?

The GPU itself has several different clock sources and is not deterministic, because a GHz clock signal is not easy to distribute over a multi-billion transistor chip.

I do not know anything about the properties of your application, but I think in general you will find that execution times for consecutive runs of an application cluster tightly if an application is compute intensive, and are more spread out when it is memory intensive. In my observation this applies to CPUs and GPUs alike.

Given the large amount of overall system state involved, it is pretty much impossible to achieve exactly the same starting state for each run. Small differences in initial state then lead to small differences in performance. They may occasionally be amplified somewhat by the “butterfly effect”.

In practical terms (e.g. automated performance tracking systems) I usually advocate a “measurement noise” limit of 2% to avoid false positives. I would consider a noise level around 1% excellent. In my thinking, that is sufficient in practical terms. When the impact of code changes for performance tuning purposes gets down to the range of a 1% noise level it probably is time to stop tuning.

Dear tera and njuffa,
Thank you for your helpful replies.
I shall try getting my genetic algorithm to ignore differences
in timings of 2% or less.

ps: I am running a “headless” GeForce 295 GTX (ie no display connected)
unfortunately /usr/bin/nvidia-smi v4.304.54 tells me “Persistence Mode” is “Disabled”.
(I think this means the driver is reloaded on each use). Whilst using the trick
/usr/bin/nvidia-smi -a -l 10 > /dev/null & to leave smi running in the back ground
is successful in reducing the whole application runtime (I think by avoiding
Linux continually reloading the driver) it has no effect on the mean runtime of
the samples clock and no obvious effect on it variance either.
Thanks again

1% variation is really good. For kernels that are memory bound I expect to see variations as high as 10%. For kernels that have low number of waves of thread blocks I would expect to see variations as high as (1/waves).

I did not follow “waves of thread blocks” – I will try and look this up.

On the first 295 GTX GPU, on a compute bound kernel, the distribution of kernel clock() times,
following the example in 0_Simple/clock/clock.cu, has a standard deviation of about 5000 ticks.
The distribution itself looks vaguly Gaussian. (However I suspect there are more large deviations
than a Gaussian would predict.)

The example code in 0_Simple/clock/clock.cu assumes that the each SM’s clock
(as recorded by clock() ) are syncronised. tera says this is not true.
When the number of blocks is equal to the number of SM,
would it be better to calculate the apparent elapse time for each block (which uses its
own clock?) and take the worse (ie max) of these? (0_Simple/clock takes the max difference
from the start times and end times across all blocks.)

Also when are the clocks reset?
The 295 GTX start times are consistently close to zero; suggesting the clocks are reset
when a new kernel is launched.

Thanks again for your help

You definitely need to time each block individually if you use clock() or ckock64()- that is also stated in the PTX manual.

If you are using the timing results for genetic evolution I would recommend using the sum of all block’s times, not the maximum. Apart from the obvious statistical smoothing effect that would also help with the indeterministic nature of memory accesses - the total bandwidth should stay about the same even if the order of accesses changes.

Dear tera,
I’m afraid I did not quite follow your advice at the time. Instead I have been basing
my code on stuff lifted from /usr/local/cuda-5.0/samples/0_Simple/clock/clock.cu
In the CUDA 5.0 samples (clock.cu) they go to great trouble to define kernel elapse
time as starting from start of the first block and finishing at the last terminating block.
NB this does the opposite of advice above and assumes clocks in different SM blocks are
synchornised. This appears to be (approximately) true on the first kernel launch
but as more kernels are launched (at least on my 295GTX, particularly if the kernels do not
load the SMs evenly) clock() for each block drift apart. In my current example after 635
launches they are effectively unrelated. (Ie signal / noise > 1.0)

I have just posted an update to
but thought the above might be better here.

Thanks again