Running a kernel blocks the CPU? Is it possible to run it asynchronously?

Hey

I see that running a kernel uses 100% of a core in my CPU. Does this mean that a kernel launch totally blocks a CPU core until its finished? None of the processing is done on the CPU, so I’m guessing its some kind of spin lock that runs in a busy wait loop until the kernel is finished. If so, it is very inefficient, and I would like to know if anybody knows of any workarounds to this?

Another equally important question is, if the kernel does indeed block one CPU core to run, how does a multi threaded program scheduler intercept the execution of the kernel processing when it suspends the CPU cuda thread to allow other threads to execute? I have a program that runs several different CPU intensive threads, and one CUDA related CPU thread. The CUDA code runs MUCH more slowly than on a single threaded application. Please advise.

Kernel launches are asynchronous, but if you do certain CUDA operations after the launch (like a cudaMemcpy, which requires that the kernel finish to give you correct answers) then the CPU will block in a busy loop polling the card until it is ready. This minimizes latency, but as you observe, uses 100% of a CPU core.

If you can tolerate increased latency, then there are ways to change this behavior. CUDA 2.2 will have options to change the blocking mechanism so that the CPU thread waiting for the GPU will yield to other threads. See the man page for cudaSetContextFlags() if you have access to the beta.

If you are using CUDA 2.1 or older, I think there is a way to do your own polling with streams, but I’m not entirely sure how that works.

Very informative, thanks!

As a note to people reading this, I noticed that multiple calls to fast CUDA kernels from an overloaded CPU take an unacceptably long amount of time to finish processing. After using the visual profiler, I noticed that the amount of time spent calculating on the GPU was the same between single threaded CPU and overloaded multithreaded CPU apps → the problem is thus when a CPU is overloaded with computationally intensive threads, CUDA does not have a proper chance to start for whatever reason and the latency between kernel launches skyrockets (for me, it was 30x slower on a loaded dual core CPU). However, I solved the problem by adding a microsleep to my CPU intensive threads, which effectively give the CUDA thread some time to launch the kernel properly.