overhead between two successive kernel calls

thanasio · July 5, 2013, 8:18pm

Hi,

why two successive kernel calls in my code have an overhead of 0.6 ms (kernel2_start - kernel2_finish = 0.6ms) on a K20 card?

code snippet:

kBinAtoms2D<<<2, 1024>>>(st, grd);
sortCoordArray<<<dimGrid2, 1024, sMemSize2>>>(st.cx, grd);

Anybody has a clue on that…?

njuffa · July 5, 2013, 9:38pm

What’s your OS platform?

thanasio · July 6, 2013, 7:30am

It is Linux : Ubuntu 12.04

thanasio · July 6, 2013, 7:31am

this is the visual profiler output. Note that the overhead is the same regardless the size of data passed in the kernels to process…!!

njuffa · July 6, 2013, 5:26pm

This is not my area of expertise (and I never use the visual profiler), but use of the profiler may well add latency to kernel calls due to instrumentation overhead. I am not clear as to what it is you are measuring. Kernel launches are asynchronous and host side measurements will only give the time to launch the kernel (insert it into the work queue), which is independent of the amount of work the kernel itself performs. Using a high-precision timer gettimeofday() at steady state, I measure 4 microseconds kernel launch time on 64-bit RHEL with a K20c.

To measure actual kernel execution time from the host you can add a call to cudaThreadSynchronize() after the kernel launch, this makes the host wait for the kernel to signal completion. The round-trip overhead on my system (as determined by launching empty kernels) is around 14 microseconds, so one would have to subtract that out to get accurate measurements for short running kernels.

Greg · July 6, 2013, 11:20pm

@thanasio,

The only information that you have provided is that the you are executing a series of kernels on a k20c on linux. The average CPU launch overhead is 4-8 µs. A value of 600 µs is significant and out of range for a normal launch.

If this delay occurs for every launch of sortCoordArr* then determine what explicit launch state is changed each launch and determine if reducing the state changes reduces the overhead. State changes include stack size, print fifo size, texture bindings, etc.

If this delay occurs for only the first launch of sortCoorArr* and the delay can be repeated on every run of the process then investigate what lazy initialization or updates the driver may be doing by looking at what features or additional resources are used by this kernel. Features that have lazy initialization include heap allocation (device malloc/free/new/delete), print fifo allocation (printf), and CUDA dynamic parallelism.

If this occurs sporadically then the OS thread scheduler may have simply context switched out the thread. It’s possible to use xperf to investigate this issue.

The CUDA profilers introduce additional overhead on to each API call and kernel launch. The overhead is .5 - 1 µs for most API calls, 5-10 µs for concurrent kernel launches, and varies on resource initialization. CUDA 5.5 reduces the overhead. I do not think the overhead is due to the profiler. If the profiler is going to introduce high overhead it will provide additional information in the Profiling Overhead row in the timeline.

thanasio · July 7, 2013, 8:44am

Greg, nJuffa

you are right…when i actually profiled the code in simulation conditions (ie. algorithm running in a infinite loop), the overhead disappeared completely. So in the above instance, the driver was (most probably) very lazily allocating memory. In the simulation the overhead is present on the first occurrence of the two successive kernels. Then it disappears…

cheers…!!

Topic		Replies	Views
kernel call overhead: timing results overhead is large for small # of calls CUDA Programming and Performance	16	8031	March 8, 2013
Kernel execution overhead CUDA Programming and Performance	2	1205	July 6, 2009
Kernel enqueue overhead Bringing kernel overhead down? CUDA Programming and Performance	9	13892	March 12, 2010
Kernel Overhead/Profiler Accuracy CUDA Programming and Performance	4	6481	May 25, 2008
Why is there 10uS between kernel launches? CUDA Programming and Performance	2	3878	August 6, 2010
kernel launch overhead Legacy PGI Compilers	8	12707	July 24, 2014
reduce overhead of launching a new thread block CUDA Programming and Performance	15	4909	February 15, 2018
fundamental cuda kernel launch questions CUDA Programming and Performance	2	16559	July 31, 2008
Kernel Launch Time (CPU Time) Reported in Visual Profiler how to optimize kernel launch CUDA Programming and Performance	1	729	July 7, 2011
kernel launch overhead for GTX 280 CUDA Programming and Performance	17	3912	November 5, 2009

overhead between two successive kernel calls

Related topics