Does kernel execution still block one CPU?

I have a dual-core CPU and would like to use OpenMP to run two CPU thread processes while a third thread runs the CUDA calls. My performance numbers aren’t quite clear about whether CUDA is letting me do this or not. So, my questions are two:

  1. Does kernel execution on either of the recent releases (0.8 and 0.9) block a CPU? I realize that the calls are asynchronous as of v0.9, but I saw no change in my performance numbers between versions 0.8 and 0.9, so I have to ask.

  2. I know that my data transfers (the non-asynchronous part of my CUDA code) take up an insignificant amount of time, so I am not sure what could be causing the performance numbers that I see. I use “omp parallel sections” to split my serial code into two threads, one contains only GPU/CUDA subroutines, the other CPU routines. The CPU thread splits again with an “omp parallel do,” and all threads rejoin at the end of the computation. Without OpenMP, the GPU part takes 24 seconds and the CPU part (1 thread) takes 58 seconds (running serially). With OpenMP, all 3 threads start simultaneously (I suspect) and the start-to-finish wall time for the GPU section is 24 seconds and the start-to-finish wall time for the CPU section is 34 seconds. If we assume that the CPU isn’t burdened by CUDA, then the 58 seconds of computation should take 29 seconds to complete, and not 34. But if we assume that CUDA blocks, then the CPU section should run one thread for 24 seconds and then two for another 17, totalling 58 seconds of work and 41 seconds of wall-clock time.

Am I approaching this incorrectly? Have I missed something? Or does CUDA still require a large amount of time from the CPU during kernel calls?

I’ve seen the same issue with our kernels as well. My observations lead me to believe that while CUDA 0.9 now supports asynchronous kernels, performing any CUDA API call immediately subsequent to executing a CUDA kernel causes the CUDA runtime to busy-wait until the currently-running kernel or API call is complete, resulting in high CPU load much as was seen with the older CUDA 0.8 release. For those that are lucky enough to have CUDA kernels that are “fire and forget” type arrangments, this may not be a problem, but for multi-pass or iterative timestep integration algorithms that run many kernels on the GPU back-to-back, this is still less than ideal.

I’ve suggested adding an API to allow polling the run status of the previous CUDA API call (e.g. cudaThreadBusy() or something akin to that) returning a simple true/false result indicating whether the previous API call has completed yet. This would allow developers to avoid this busy-waits on cudaThreadSynchronize() or the implicit synchronizations that occur when calling multiple CUDA APIs or kernels back-to-back.

Another case where this shows up is in the situation when you want to drive multiple CUDA capable GPUs using several threads multiplexed onto a single CPU core. This occurs if you have a system that contains more GPUs than CPUs, for example. In such a situation, you don’t want any of the host threads managing GPUs to do busy-waiting as they will starve for CPU and the result will be poorer performance.

In the case that a thread has got nothing better to do than busy wait, it would still be preferable to do something like call shed_yield() in the busy-wait loop so that any other runnable thread gets a shot a running before returning to the current (busy-waiting) thread. Another way to do this would be to use a “wakeup” type implementation rather than a busy-wait implementation, using condition variables or something of that sort.

I have no idea how the hardware indicates completion of a running CUDA kernel, so it may be that only one of these ideas would actually work well, but having any one of them would effectively solve the problem. Ideally I’d like to see a polling call implemented and also have an alternative “wakeup” based implementation of cudaThreadSynchronize(), but I’ll take what I can get. :-)

John Stone

I experienced similar problems with 0.9. In theory, you should be able to do CUDA calls without blocking after a kernel (with exceptions) but it looks like most of the time memcpy between host and device is actually the only thing you need to do as with the possibilities of CUDA, almost all processing can be mapped to the GPU. Unfortunately, the D2H copy is blocking. Interestingly, the D2D copy is asynchronous. So if you have spare device mem, you could collect kernel results using D2D and then download all in one batch at the end so you block only once. But if you cannot rearrange your algorithm like that, I agree with John that some thread/kernel state polling calls would be good to have.


A polling call would also match pretty well to the existing programming strategies for non-blocking message passing in APIs like MPI. Adding CUDA code into an MPI programs that operate as event handling loops would be much easier this way. Any place where an MPI code is written using API calls like MPI_Iprobe(), MPI_Testsome(), MPI_Testany(), MPI_Testall() is the type of code that would (if using CUDA) would want something like a cudaThreadBusy() or something akin to that. It then becomes easy manage multiple GPUs within an event processing loop that’s similar to the existing
MPI code.


Similarly, you could wrap every CUDA call in a pthread with a shared bool that indicates whether or not the function has finished. Just have the CUDA function set the bool when its finished.

So you could do something like this

bool finished;

finished = false;


… do something …

… poll …



Sleep until thread finishes and wakes me up.