Slow loading kernel to GPU

Hi guys,

I’m just getting my feet wet with this, and I’m kind of surprised how long it is taking to load the kernel to the GPU. I have a function that takes 40 microseconds to execute on a core2duo quad at 3ghz that is threaded with openmp and vectorized using the intel compiler. I did a very naive threading in CUDA, with 512 threads and was surprised to see that it was at 60 microseconds. So, I created a blank kernel function

// setup execution parameters

    dim3 grid(1, 1, 1);

    dim3 threads(num_threads, 1, 1);


    for(int i = 0; i < 100000; i++)


  blank<<<grid, threads>>>();



__global__ void blank()



and this alone takes (when averaged) 30 microseconds per call. So, the naive implementation was faster, but the overall function call was slower. Now, this function gets run hundreds of millions of times so there is room for savings. But, some logic from a very complicated class has to be applied after each iteration, so I can’t group these in the kernel function. So, I have a couple of questions.

  1. Is this load time normal?

  2. Is there anyway to get around this?

I am using the beta sdk on windows vista with a geforce 8600 GT and the beta driver from the 2.0 sdk download. If you can help, I would greatly appreciate it.

~ Steve

The overhead that you are seeing is fairly typical. You need to batch more work in a kernel call to get any benefit from CUDA. If you have to, you could launch no more blocks than multiprocessors and use atomic operations in global memory to synchronize across thread blocks.

Thanks, I guess i had thought the overhead would be similar to openMP which really isn’t so bad.

Yeah, it’s the difference between context switching and synchronization among threads on the CPU, and making calls out to a card sitting on the much slower and higher latency PCI Express bus.

I think I was asking essentially the same over there.

So, … is there a difference between (runtime):

for ( i=0; i<N; ++i ) {

blank<<<grid, threads>>>();


and direct use of driver:

cuModuleLoad( &mod );

cuModuleGetFunction( &func, mod );

for ( i=0; i<N; ++i ) {

cuLaunchGrid( func, … );


Naturally, that cannot help with PCI-e, and hopefully the runtime is smart. (Is it?)

Machine: Sun Ultra 40 M2 / CentOS 5.1 / CUDA 1.1
I get 12.6us time per empty kernel call. That’s a little faster than your 30, but nothing to be excited about. I’m going to try CUDA 2.0 a little later and see if the situation is any different.

The unfortunate truth of GPU computing is that they are ill-suited to “small” problems because of this launch overhead. Run a kernel that takes several milliseconds and the overhead is minuscule in comparison.

You mention that a complicated class needs to perform operations in between each iteration. Does that mean you also need to copy some data to/from the device each time? That will hurt even more than the kernel call overhead.

I’m on 2.0, so maybe it was better under 1.1.

The portion I need to move back is only 16 bytes. 12.6 would even give me hope that I could get a 25% speedup if I really optimized, and 25% off of 5 hours would be great. I hope they eventually move this into a high level API similar to OpenMP that can be used from C++.

On the same machine, CUDA 2.0 improves the time to ~11.9us (yes this was tested over multiple runs: fluctuations are +/- 0.2us).

Thanks for the measurement, maybe its a bios issue

It is Vista vs Linux

Thanks for answering. Is there any chance that it will improve to be similar, or is it an inherent Vista limitation?

Vista support is still in beta, so there is room for improvements. Some limitations are coming from the OS, but we are working to get the best possible performances.