Slow loading kernel to GPU

diehard2 · April 18, 2008, 1:06am

Hi guys,

I’m just getting my feet wet with this, and I’m kind of surprised how long it is taking to load the kernel to the GPU. I have a function that takes 40 microseconds to execute on a core2duo quad at 3ghz that is threaded with openmp and vectorized using the intel compiler. I did a very naive threading in CUDA, with 512 threads and was surprised to see that it was at 60 microseconds. So, I created a blank kernel function

// setup execution parameters

    dim3 grid(1, 1, 1);

    dim3 threads(num_threads, 1, 1);

   

    for(int i = 0; i < 100000; i++)

    {

  blank<<<grid, threads>>>();

     }

and

__global__ void blank()

{

}

and this alone takes (when averaged) 30 microseconds per call. So, the naive implementation was faster, but the overall function call was slower. Now, this function gets run hundreds of millions of times so there is room for savings. But, some logic from a very complicated class has to be applied after each iteration, so I can’t group these in the kernel function. So, I have a couple of questions.

Is this load time normal?
Is there anyway to get around this?

I am using the beta sdk on windows vista with a geforce 8600 GT and the beta driver from the 2.0 sdk download. If you can help, I would greatly appreciate it.

~ Steve

chris22 · April 18, 2008, 1:41am

The overhead that you are seeing is fairly typical. You need to batch more work in a kernel call to get any benefit from CUDA. If you have to, you could launch no more blocks than multiprocessors and use atomic operations in global memory to synchronize across thread blocks.

diehard2 · April 18, 2008, 2:14am

Thanks, I guess i had thought the overhead would be similar to openMP which really isn’t so bad.

seibert · April 18, 2008, 11:32am

Yeah, it’s the difference between context switching and synchronization among threads on the CPU, and making calls out to a card sitting on the much slower and higher latency PCI Express bus.

jlehtone · April 18, 2008, 12:13pm

I think I was asking essentially the same over there.

So, … is there a difference between (runtime):

for ( i=0; i<N; ++i ) {

blank<<<grid, threads>>>();

}

and direct use of driver:

cuModuleLoad( &mod );

cuModuleGetFunction( &func, mod );

…

for ( i=0; i<N; ++i ) {

cuLaunchGrid( func, … );

}

Naturally, that cannot help with PCI-e, and hopefully the runtime is smart. (Is it?)

MisterAnderson42 · April 18, 2008, 1:27pm

Machine: Sun Ultra 40 M2 / CentOS 5.1 / CUDA 1.1
I get 12.6us time per empty kernel call. That’s a little faster than your 30, but nothing to be excited about. I’m going to try CUDA 2.0 a little later and see if the situation is any different.

The unfortunate truth of GPU computing is that they are ill-suited to “small” problems because of this launch overhead. Run a kernel that takes several milliseconds and the overhead is minuscule in comparison.

You mention that a complicated class needs to perform operations in between each iteration. Does that mean you also need to copy some data to/from the device each time? That will hurt even more than the kernel call overhead.

diehard2 · April 18, 2008, 1:48pm

I’m on 2.0, so maybe it was better under 1.1.

The portion I need to move back is only 16 bytes. 12.6 would even give me hope that I could get a 25% speedup if I really optimized, and 25% off of 5 hours would be great. I hope they eventually move this into a high level API similar to OpenMP that can be used from C++.

MisterAnderson42 · April 18, 2008, 2:57pm

On the same machine, CUDA 2.0 improves the time to ~11.9us (yes this was tested over multiple runs: fluctuations are +/- 0.2us).

diehard2 · April 18, 2008, 2:59pm

Thanks for the measurement, maybe its a bios issue

mfatica · April 18, 2008, 3:07pm

It is Vista vs Linux

diehard2 · April 18, 2008, 3:33pm

Thanks for answering. Is there any chance that it will improve to be similar, or is it an inherent Vista limitation?

mfatica · April 18, 2008, 3:54pm

Vista support is still in beta, so there is room for improvements. Some limitations are coming from the OS, but we are working to get the best possible performances.

Topic		Replies	Views
Why CUDA kernel calls takes so long? CUDA Programming and Performance	2	1533	July 17, 2017
kernel call overhead: timing results overhead is large for small # of calls CUDA Programming and Performance	16	8031	March 8, 2013
kernel launch overhead for GTX 280 CUDA Programming and Performance	17	3912	November 5, 2009
Why is there 10uS between kernel launches? CUDA Programming and Performance	2	3878	August 6, 2010
fundamental cuda kernel launch questions CUDA Programming and Performance	2	16559	July 31, 2008
overhead between two successive kernel calls CUDA Programming and Performance	6	1856	July 7, 2013
Kernel enqueue overhead Bringing kernel overhead down? CUDA Programming and Performance	9	13892	March 12, 2010
reduce overhead of launching a new thread block CUDA Programming and Performance	15	4909	February 15, 2018
kernel launch overhead Legacy PGI Compilers	8	12707	July 24, 2014
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20462	May 4, 2007

Slow loading kernel to GPU

Related topics