overhead between two successive kernel calls


why two successive kernel calls in my code have an overhead of 0.6 ms (kernel2_start - kernel2_finish = 0.6ms) on a K20 card?

code snippet:

kBinAtoms2D<<<2, 1024>>>(st, grd);
sortCoordArray<<<dimGrid2, 1024, sMemSize2>>>(st.cx, grd);

Anybody has a clue on that…?

What’s your OS platform?

It is Linux : Ubuntu 12.04

this is the visual profiler output. Note that the overhead is the same regardless the size of data passed in the kernels to process…!!

This is not my area of expertise (and I never use the visual profiler), but use of the profiler may well add latency to kernel calls due to instrumentation overhead. I am not clear as to what it is you are measuring. Kernel launches are asynchronous and host side measurements will only give the time to launch the kernel (insert it into the work queue), which is independent of the amount of work the kernel itself performs. Using a high-precision timer gettimeofday() at steady state, I measure 4 microseconds kernel launch time on 64-bit RHEL with a K20c.

To measure actual kernel execution time from the host you can add a call to cudaThreadSynchronize() after the kernel launch, this makes the host wait for the kernel to signal completion. The round-trip overhead on my system (as determined by launching empty kernels) is around 14 microseconds, so one would have to subtract that out to get accurate measurements for short running kernels.


The only information that you have provided is that the you are executing a series of kernels on a k20c on linux. The average CPU launch overhead is 4-8 µs. A value of 600 µs is significant and out of range for a normal launch.

If this delay occurs for every launch of sortCoordArr* then determine what explicit launch state is changed each launch and determine if reducing the state changes reduces the overhead. State changes include stack size, print fifo size, texture bindings, etc.

If this delay occurs for only the first launch of sortCoorArr* and the delay can be repeated on every run of the process then investigate what lazy initialization or updates the driver may be doing by looking at what features or additional resources are used by this kernel. Features that have lazy initialization include heap allocation (device malloc/free/new/delete), print fifo allocation (printf), and CUDA dynamic parallelism.

If this occurs sporadically then the OS thread scheduler may have simply context switched out the thread. It’s possible to use xperf to investigate this issue.

The CUDA profilers introduce additional overhead on to each API call and kernel launch. The overhead is .5 - 1 µs for most API calls, 5-10 µs for concurrent kernel launches, and varies on resource initialization. CUDA 5.5 reduces the overhead. I do not think the overhead is due to the profiler. If the profiler is going to introduce high overhead it will provide additional information in the Profiling Overhead row in the timeline.

Greg, nJuffa

you are right…when i actually profiled the code in simulation conditions (ie. algorithm running in a infinite loop), the overhead disappeared completely. So in the above instance, the driver was (most probably) very lazily allocating memory. In the simulation the overhead is present on the first occurrence of the two successive kernels. Then it disappears…