CUDA 1.0 Asynchronous Launches

Does this mean that CUDA v1.0 supports parallel execution of several different kernels?

No, but CPU can carry on with other work. If you need to run different CUDA kernels, you can do it by combining them into a single kernel and using divergence based on thread id or block id to execute different paths (recall that there is a performance penalty for divergence only if threads within the same warp diverge).


I’ve seen a situation where performance dropped when extra code was added into a kernel. This extra code was NOT executed, but its presence caused the number of registers used by the kernel to increase. As a result, occupancy decreased and performance dropped.

Would combining multiple smaller kernels into a single large kernel like you suggest suffer from this problem too?

Ah, I’ve heard such suggestion already. I just hoped CUDA 1.0 finally supports parallel execution of several kernels.

I cannot easily combine different kernels into a single one. This can be done if kernel code is predefined. But this is not my case - I would like to use kernels whose code is not known a’priori (it is compiled into cubin file).

Why CUDA developers limit this functionality? There could be two operation modes: all power to a single kernel, and spread power between several kernels.

The answer is most likely yes. However, you can combat the increased register pressure by using the maxregcount option to the nvcc compiler (check the compiler documentation for details). You may be able to force register reuse and increase occupancy that way. As always, the optimal configuration should be obtained experimentally.



When kernel is run the CPU usage will be 100%. How the host can carry on the others tasks (when kernel is run)? Or will hold kernel works during this process?


v1.0 has asynchronous kernel launches, so the CPU usage should no longer be 100% after a launch. Let me know if you’re experienceing something different.


In the appendix the simple test. It is possible to start it and to check up loading a host. It will be 100% during all work of a kernel. The kernel is work for a long time - more than 200ms and 100 times are started. (I doubt, that exists kernel which unloads CPU (in this version CUDA)) (5.42 KB)


We’ve previously filed a bug on this. I don’t know precisely how the CUDA runtime is implemented, but I believe that it busy-waits when the user explicitly calls cudaThreadSynchronize() or when the CUDA API itself needs to wait for an already-running async task to complete before beginning a subsequent operation (an implicit synchronization if you will). I have observed this 100% CPU usage behavior in several of our kernels that have to do multi-pass computations for this reason. My suggestion has been to either make the CUDA runtime implement this with a thread-wakeup type strategy, or that a new CUDA API be added, allowing the host code to poll the GPU, (thus allowing the caller to do their own busy-wait if they like, but allowing the caller the opportunity to call usleep(), nanosleep(), sched_yield() or other similar APIs that result in the current thread yielding to anything else that’s currently runnable). The polling approach has the benefit of being well-matched to the asynchronous message passing APIs in MPI.


John Stone

I think you can delete cudaThreadSynchronize() from your code.

cudaThreadSynchronize() is not useful except for timing your kernel.


We’ve all been had here - all the talk about fixing this problem was pure marketing hype. All that was done was the spin wait in a kernel call was taken from the end of the routine and put on the front and likewise for the host/device transfer functions when we had been promised concurrent host/device transfers.

Currently all one can do is guess how long your kernel is going to run and sleep for that long before doing anything else.

The correct solution is to have the card write to an open file descriptor in UNIX so that select/poll can be used and the thread can be used for other purposes while a kernel is running. In Windows you need to create an IO completion port so that one can wait on multiple events. Both these require that the G80 card can issue a hardware interrupt upon completion. My question here was never answered (back in April).

The API can just be extended in a backward compatible way to provide these features.


ed: Ok it’s not quite that bad - these comments were based upon reports here before running my own tests - on my system up 18 kernel launches are saved up in the driver, any more calls cause a busy wait in user land. The only way to wait is a busy wait before a host/device transfer that spin waits in the kernel device driver for the queued operations to complete. Launches queued in the device driver are fast now - I measure 12-14us dispatch time which is a big improvement.