Kernel enqueue overhead Bringing kernel overhead down?

madsen · March 11, 2010, 9:11am

Hi!

I measure on my system that there is an overhead of about 40um when starting a kernel. Is this normal and can I do anything to bring this down?
My kernel takes about 60um to execute, so the overhead is quite large!

I do not do any transfer between kernels, but I enqueue millions of kernels as fast as I can on the CPU. I have tried to do multiple steps of the algorithm on the GPU and this obviously helps a lot, but fails for problems needing more than one work group, since I need “global” synchronization before continuing to the next step.

(the same code take about 100um to execute on the CPU, so the GPU is 2x faster in the actual computation, but due to the overhead of a kernel launch, it comes to a tie in total)

Best Regards,
Madsen

kreon · March 11, 2010, 9:34am

imho it’s the time to build a program with source, check if you can build it with binary

madsen · March 11, 2010, 9:48am

But my build command is executed outside the timing loop?

kreon · March 11, 2010, 10:06am

how did you measure a kernel overhead execution and kernel execution itself then

madsen · March 11, 2010, 10:27am

I measured the kernel execution time using the profiler option on the execution queue. This gave

Execution time (only kernel launch loop): 13.44s (c++ timer)

Kernel Execution Time: 7.30474s total, 7.30467e-05s pr. kernel launch

The I turn off the profiling and measure the execution time:

Execution time (only kernel launch loop): 11.43s (c++ timer)

Since 100001 kernels were executed, the overhead of launching the kernels must be

(11.43-7.30)s/100001 = 41us pr. kernel launch

Clearly the kernel launch takes a significant amount of time, the profiling give wrong results, or the CPU can’t feed the queue fast enough?

Launching 100001 “empty” kernels give (with profiling):

Kernel Execution Time: 0.45063s total, 4.50626e-06s pr. kernel launch

And without profiling:

Execution time (only kernel launch loop): 4.56s (c++ timer)

kreon · March 11, 2010, 1:07pm

Interesting research. Looks like we can’t profit from simple kernels, because their execution overhead will cover their SIMD advantage. You are not using beta driver, are you ?

madsen · March 11, 2010, 1:25pm

I use this:

(II) NVIDIA GLX Module 195.17 Mon Oct 26 08:26:05 PST 2009

I think it is a beta. Would the latest non-beta “190.53 Certified” be better? There seem to be a newer beta called 195.30, which I could try.

I just found out, that if I remove the 7 image2d_t arguments I have for the kernel, the latency will be cut in half!

kreon · March 11, 2010, 3:39pm

the last 195.36.08, which will be available soon, doubles the performance for me

jcpalmer · March 11, 2010, 7:22pm

I wonder if your #'s will change any if you warm the kernel up first. Meaning do a couple of runs off the clock first. The first run of a kernel seems to run horribly for me.

madsen · March 12, 2010, 7:24am

The numbers I posted was the average kernel execution time over 100001 executions.

Topic		Replies	Views
Kernel execution overhead CUDA Programming and Performance	2	1190	July 6, 2009
overhead between two successive kernel calls CUDA Programming and Performance	6	1813	July 7, 2013
How big is the kernel invocation overhead? CUDA Programming and Performance	9	5098	December 17, 2008
Kernel Overhead/Profiler Accuracy CUDA Programming and Performance	4	6449	May 25, 2008
fundamental cuda kernel launch questions CUDA Programming and Performance	2	16538	July 31, 2008
kernel call overhead: timing results overhead is large for small # of calls CUDA Programming and Performance	16	7922	March 8, 2013
kernel launch overhead Legacy PGI Compilers	8	12660	July 24, 2014
kernel launch overhead for GTX 280 CUDA Programming and Performance	17	3750	November 5, 2009
kernel launch time way too long CUDA Programming and Performance	6	4088	July 5, 2011
Performance measurement CUDA Programming and Performance	3	673	April 29, 2011

Kernel enqueue overhead Bringing kernel overhead down?

Related topics