kernel launch time way too long

I’m trying to convert some code to OpenCL with windows 7.

Timing the following three lines of code on a tesla c1060 takes about 1.2ms

size_t globalSz[2] = {320, 240};
clEnqueueNDRangeKernel(queue, Test, 2, NULL, globalSz, NULL, 0, NULL, NULL);
clFinish(queue);

I believe that kernel launch overhead should be more in the area of 30us (at least those are the time with CUDA), so I’m guessing that I’m doing something wrong. The kernel itself is just the empty kernel so that I’m only supposed to be seeing the kernel lauch overhead. The original C code runs for about 1ms, so 1.2 ms kernel lauch overhead is unacceptable. Any ideas what I may be doing wrong?

Thanks

I hope you are doing a clFinish(queue) before

size_t globalSz[2] = {320, 240};

clEnqueueNDRangeKernel(queue, Test, 2, NULL, globalSz, NULL, 0, NULL, NULL);

clFinish(queue);

How are you timing this? You can associate an event with Kernel launch and time it using the profiling features provided by OpenCL.

I’m timing this with a profiler that measured CPU elapsed time between programmable events (which is what I’m interested in). I’m also timing the average of multiple runs and ignoring the first run due to its overhead. I also made sure to put a clFinish before timing.

Have you tried comparing your results and/or timestamps with the NVIDIA compute profiler? That might help explaining that behavior.

Additionally it might be helpful to know if your context is for multiple devices, and with which flags you created you buffers.

Regards,
Markus

Have you tried comparing your results and/or timestamps with the NVIDIA compute profiler? That might help explaining that behavior.

Additionally it might be helpful to know if your context is for multiple devices, and with which flags you created you buffers.

Regards,
Markus

Hi folks,

We also see unexpectedly large kernel startup times in the range of

5ms. Hardware is a GTX260, driver version 270.29 running under ubuntu 10.04.2 64bit,

and we looked at the (oclTridiagonal) example in the SDK.

for (int iCycles = 0; iCycles < BENCH_ITERATIONS; iCycles++) {

  shrDeltaT(0);

  for (int i = 0; i < devCount; i++) {

    errcode = clEnqueueNDRangeKernel(cqCommandQue[i], pcrKernel[i], 1, NULL, &szGlobalWorkSize[i], &szLocalWorkSize[i], 0, NULL, &GPUExecution[i]);

    clFlush(cqCommandQue[i]);

    oclCheckError(errcode, CL_SUCCESS);

  }

  clWaitForEvents(devCount, GPUExecution);

  sum_time += shrDeltaT(0);

}

the nvidia computeprof reports sth like 8ms in between the calls, the kernel

itself executes ~3.5ms. From what I find in other postings I would expect tens

of micro seconds to start up a kernel, but these data suggest rather 5ms, a

factor of hundred worse. Is there anything wrong in this consideration? As we

have kernels that don’t run for very long this is a real performance killer. Any

idea how to reduce the large gap would be highly appreciated,

Addendum:

It is a matter of luck whether one gets api trace information from the

computeprof. Usually one ends up in 'temp_compute_profiler_0_0.csv for

application run … not found’. We succeeded for the cuda version of the Black

Scholes SDK example, and this provides evidence for the 5ms kernel launch time

theory.

Most of the time between successive kernel calls is spent in cuLaunchGridAsync

(screen shot attached). Any information whether this time depends on anything we

could influence would be very welcome.

screen shot explanation: x axis is time in us. four blue areas represent kernel

execution, the top right yellow panel describes the green area between 3rd and

4th kernel call. Observe the given time duration of 5062.05 us.

Of course, profiling has heavy influence on timing. Switching profiling off and

simply measuring time between enqueuing and finishing the computation reveals

more realistic times in the range of 1ms.

Addendum2:

Some more investigation done: The elsewhere reported kernel launch latencies of

30us simply tell how long it takes until the CPU regains control after

enqueueing the GPU kernel. It does not tell how long it takes until the GPU

actually starts computing. The former takes only tens of micro seconds, the

latter is a factor of hundred larger. The fast return of control can only be

used in a sensible way if there is work to be done on the CPU that doesn’t

depend on the data on the GPU.

Conclusion: It always takes ~1ms before a submitted kernel starts working. If

profiling is switched on, one finds values in the range of ~4ms. It does not

make sense to invest in kernel optimisation if the kernel is running for only a

fraction of a millisecond.

Martin