kernel execution time not constant

steffenS · March 1, 2017, 3:01pm

I have observed some strange behavior on kernel execution time. I repeatedly run a kernel an measure the kernel time using Cuda events. The kernel parameters (number of blocks, threads, shared memory size) are not changed between the calls.

while (work to do)
{
ceResult = cudaEventRecord(kernelStartEvent, 0);

StartCudaKernel();

ceResult = cudaEventRecord(kernelStopEvent, 0);
ceResult = cudaEventSynchronize(kernelStopEvent);
ceResult = cudaEventElapsedTime(&fCudaKernelTime, kernelStartEvent, kernelStopEvent);

}

Running the kernel on a GeForce GTX 1050 Ti, I get almost constant kernel times.
When I start the same code on a GeForce GeForce GTX 750 Ti, the kernel times vary from 776ms up to 1449ms. Running on a notebook using a Quadro M2000M brings the same unsteady results.

I use the kernel time measurement to estimate how much calulation work can be done within 2 seconds for not running into TDR on Windows OS. With these varying execution times it is very hard to find a optimal work load. The problem was observed using CUDA 5.5 and CUDA 8.0
So far, the kernel execution time was very linear, meaning double calculation work resulted in double execution time. Have there been any changes to the driver or the architecture that cause this unsteady behavior?
Is there a workaround for achieving constant kernel times?

tera · March 1, 2017, 3:59pm

Usually varying kernel runtime is caused by other GPU activity (e.g. from the GUI). Windows kernel batching can also play a role.

A variation by more than half a second seems too much to be explained by that though.

njuffa · March 1, 2017, 8:41pm

Another factor could be automatic clock boosting. Some GPUs have a rather large difference between their default core clock and the highest possible boost clock. Certainly not a factor of two, though.

Without any knowledge of the code and no reproducability, we are limited to (wild) speculation here. A contributing issue may be measurement methodology. A best practice would be to measure performance at steady state, to minimize the potential impact of startup overheads, cache and TLB warmup etc.