kernel launch time way too long

laughingrice · June 10, 2011, 9:49pm

I’m trying to convert some code to OpenCL with windows 7.

Timing the following three lines of code on a tesla c1060 takes about 1.2ms

size_t globalSz[2] = {320, 240};
clEnqueueNDRangeKernel(queue, Test, 2, NULL, globalSz, NULL, 0, NULL, NULL);
clFinish(queue);

I believe that kernel launch overhead should be more in the area of 30us (at least those are the time with CUDA), so I’m guessing that I’m doing something wrong. The kernel itself is just the empty kernel so that I’m only supposed to be seeing the kernel lauch overhead. The original C code runs for about 1ms, so 1.2 ms kernel lauch overhead is unacceptable. Any ideas what I may be doing wrong?

Thanks

Crankie · June 13, 2011, 8:39am

I hope you are doing a clFinish(queue) before

size_t globalSz[2] = {320, 240};

clEnqueueNDRangeKernel(queue, Test, 2, NULL, globalSz, NULL, 0, NULL, NULL);

clFinish(queue);

Crankie · June 13, 2011, 8:40am

How are you timing this? You can associate an event with Kernel launch and time it using the profiling features provided by OpenCL.

laughingrice · June 13, 2011, 9:00am

I’m timing this with a profiler that measured CPU elapsed time between programmable events (which is what I’m interested in). I’m also timing the average of multiple runs and ignoring the first run due to its overhead. I also made sure to put a clFinish before timing.

philipjfry · June 14, 2011, 9:38am

Have you tried comparing your results and/or timestamps with the NVIDIA compute profiler? That might help explaining that behavior.

Additionally it might be helpful to know if your context is for multiple devices, and with which flags you created you buffers.

Regards,
Markus

philipjfry · June 14, 2011, 9:38am

Have you tried comparing your results and/or timestamps with the NVIDIA compute profiler? That might help explaining that behavior.

Additionally it might be helpful to know if your context is for multiple devices, and with which flags you created you buffers.

Regards,
Markus

mmgg · July 5, 2011, 1:48pm

Hi folks,

We also see unexpectedly large kernel startup times in the range of

5ms. Hardware is a GTX260, driver version 270.29 running under ubuntu 10.04.2 64bit,

and we looked at the (oclTridiagonal) example in the SDK.

for (int iCycles = 0; iCycles < BENCH_ITERATIONS; iCycles++) {

  shrDeltaT(0);

  for (int i = 0; i < devCount; i++) {

    errcode = clEnqueueNDRangeKernel(cqCommandQue[i], pcrKernel[i], 1, NULL, &szGlobalWorkSize[i], &szLocalWorkSize[i], 0, NULL, &GPUExecution[i]);

    clFlush(cqCommandQue[i]);

    oclCheckError(errcode, CL_SUCCESS);

  }

  clWaitForEvents(devCount, GPUExecution);

  sum_time += shrDeltaT(0);

}

the nvidia computeprof reports sth like 8ms in between the calls, the kernel

itself executes ~3.5ms. From what I find in other postings I would expect tens

of micro seconds to start up a kernel, but these data suggest rather 5ms, a

factor of hundred worse. Is there anything wrong in this consideration? As we

have kernels that don’t run for very long this is a real performance killer. Any

idea how to reduce the large gap would be highly appreciated,

Addendum:

It is a matter of luck whether one gets api trace information from the

computeprof. Usually one ends up in 'temp_compute_profiler_0_0.csv for

application run … not found’. We succeeded for the cuda version of the Black

Scholes SDK example, and this provides evidence for the 5ms kernel launch time

theory.

Most of the time between successive kernel calls is spent in cuLaunchGridAsync

(screen shot attached). Any information whether this time depends on anything we

could influence would be very welcome.

screen shot explanation: x axis is time in us. four blue areas represent kernel

execution, the top right yellow panel describes the green area between 3rd and

4th kernel call. Observe the given time duration of 5062.05 us.

Of course, profiling has heavy influence on timing. Switching profiling off and

simply measuring time between enqueuing and finishing the computation reveals

more realistic times in the range of 1ms.

Addendum2:

Some more investigation done: The elsewhere reported kernel launch latencies of

30us simply tell how long it takes until the CPU regains control after

enqueueing the GPU kernel. It does not tell how long it takes until the GPU

actually starts computing. The former takes only tens of micro seconds, the

latter is a factor of hundred larger. The fast return of control can only be

used in a sensible way if there is work to be done on the CPU that doesn’t

depend on the data on the GPU.

Conclusion: It always takes ~1ms before a submitted kernel starts working. If

profiling is switched on, one finds values in the range of ~4ms. It does not

make sense to invest in kernel optimisation if the kernel is running for only a

fraction of a millisecond.

Martin

Topic		Replies	Views
Kernel enqueue overhead Bringing kernel overhead down? CUDA Programming and Performance	9	13738	March 12, 2010
Dispatch Kernel Overhead (OpenCL) CUDA Programming and Performance	6	3550	March 28, 2017
CUDA Graphs Impact CUDA Programming and Performance	2	468	September 17, 2021
OpenCL performance issues CUDA Programming and Performance	5	1924	January 25, 2010
OpenCL Asynchronous Kernel Launches CUDA Programming and Performance	6	2704	May 24, 2022
Kernel Launch Time (CPU Time) Reported in Visual Profiler how to optimize kernel launch CUDA Programming and Performance	1	682	July 7, 2011
Very slow to launch just 1 kernel CUDA Programming and Performance	2	639	March 26, 2015
Overlapping kernel computing with stream per (CPU) thread, slow kernel launches CUDA Programming and Performance	10	3654	October 21, 2017
kernel launch latency CUDA Programming and Performance	16	7734	August 6, 2018
launch kernels in parallel? CUDA Programming and Performance	16	23973	July 29, 2010

kernel launch time way too long

Related topics