Hi folks,
We also see unexpectedly large kernel startup times in the range of
5ms. Hardware is a GTX260, driver version 270.29 running under ubuntu 10.04.2 64bit,
and we looked at the (oclTridiagonal) example in the SDK.
for (int iCycles = 0; iCycles < BENCH_ITERATIONS; iCycles++) {
shrDeltaT(0);
for (int i = 0; i < devCount; i++) {
errcode = clEnqueueNDRangeKernel(cqCommandQue[i], pcrKernel[i], 1, NULL, &szGlobalWorkSize[i], &szLocalWorkSize[i], 0, NULL, &GPUExecution[i]);
clFlush(cqCommandQue[i]);
oclCheckError(errcode, CL_SUCCESS);
}
clWaitForEvents(devCount, GPUExecution);
sum_time += shrDeltaT(0);
}
the nvidia computeprof reports sth like 8ms in between the calls, the kernel
itself executes ~3.5ms. From what I find in other postings I would expect tens
of micro seconds to start up a kernel, but these data suggest rather 5ms, a
factor of hundred worse. Is there anything wrong in this consideration? As we
have kernels that don’t run for very long this is a real performance killer. Any
idea how to reduce the large gap would be highly appreciated,
Addendum:
It is a matter of luck whether one gets api trace information from the
computeprof. Usually one ends up in 'temp_compute_profiler_0_0.csv for
application run … not found’. We succeeded for the cuda version of the Black
Scholes SDK example, and this provides evidence for the 5ms kernel launch time
theory.
Most of the time between successive kernel calls is spent in cuLaunchGridAsync
(screen shot attached). Any information whether this time depends on anything we
could influence would be very welcome.
screen shot explanation: x axis is time in us. four blue areas represent kernel
execution, the top right yellow panel describes the green area between 3rd and
4th kernel call. Observe the given time duration of 5062.05 us.
Of course, profiling has heavy influence on timing. Switching profiling off and
simply measuring time between enqueuing and finishing the computation reveals
more realistic times in the range of 1ms.
Addendum2:
Some more investigation done: The elsewhere reported kernel launch latencies of
30us simply tell how long it takes until the CPU regains control after
enqueueing the GPU kernel. It does not tell how long it takes until the GPU
actually starts computing. The former takes only tens of micro seconds, the
latter is a factor of hundred larger. The fast return of control can only be
used in a sensible way if there is work to be done on the CPU that doesn’t
depend on the data on the GPU.
Conclusion: It always takes ~1ms before a submitted kernel starts working. If
profiling is switched on, one finds values in the range of ~4ms. It does not
make sense to invest in kernel optimisation if the kernel is running for only a
fraction of a millisecond.
Martin