Running a kernel blocks the CPU? Is it possible to run it asynchronously?

bluebit · April 21, 2009, 12:51pm

Hey

I see that running a kernel uses 100% of a core in my CPU. Does this mean that a kernel launch totally blocks a CPU core until its finished? None of the processing is done on the CPU, so I’m guessing its some kind of spin lock that runs in a busy wait loop until the kernel is finished. If so, it is very inefficient, and I would like to know if anybody knows of any workarounds to this?

Another equally important question is, if the kernel does indeed block one CPU core to run, how does a multi threaded program scheduler intercept the execution of the kernel processing when it suspends the CPU cuda thread to allow other threads to execute? I have a program that runs several different CPU intensive threads, and one CUDA related CPU thread. The CUDA code runs MUCH more slowly than on a single threaded application. Please advise.

seibert · April 21, 2009, 1:46pm

Kernel launches are asynchronous, but if you do certain CUDA operations after the launch (like a cudaMemcpy, which requires that the kernel finish to give you correct answers) then the CPU will block in a busy loop polling the card until it is ready. This minimizes latency, but as you observe, uses 100% of a CPU core.

If you can tolerate increased latency, then there are ways to change this behavior. CUDA 2.2 will have options to change the blocking mechanism so that the CPU thread waiting for the GPU will yield to other threads. See the man page for cudaSetContextFlags() if you have access to the beta.

If you are using CUDA 2.1 or older, I think there is a way to do your own polling with streams, but I’m not entirely sure how that works.

bluebit · April 21, 2009, 2:18pm

Very informative, thanks!

As a note to people reading this, I noticed that multiple calls to fast CUDA kernels from an overloaded CPU take an unacceptably long amount of time to finish processing. After using the visual profiler, I noticed that the amount of time spent calculating on the GPU was the same between single threaded CPU and overloaded multithreaded CPU apps → the problem is thus when a CPU is overloaded with computationally intensive threads, CUDA does not have a proper chance to start for whatever reason and the latency between kernel launches skyrockets (for me, it was 30x slower on a loaded dual core CPU). However, I solved the problem by adding a microsleep to my CPU intensive threads, which effectively give the CUDA thread some time to launch the kernel properly.

Topic		Replies	Views
Does kernel execution still block one CPU? CUDA Programming and Performance	4	10382	October 26, 2007
Multi kernels CUDA Programming and Performance	13	4795	July 3, 2007
CPU load when kernel is running why 100%? CUDA Programming and Performance	14	8403	December 22, 2008
CUDA 1.0 Asynchronous Launches CUDA Programming and Performance	10	9545	June 29, 2007
Kernel launches/synchronization CUDA Programming and Performance	1	2061	September 28, 2009
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20351	May 4, 2007
Host CPU busy while waiting ? CUDA Programming and Performance	3	2181	May 5, 2009
Do the non-async calls sleep or burn CPU? CUDA Programming and Performance	20	22252	January 13, 2008
What happens to my CPU thread? CUDA Programming and Performance	1	1297	August 19, 2009
Async Kernel launch cpu seems not getting control after kernel launch CUDA Programming and Performance	7	3249	December 3, 2008

Running a kernel blocks the CPU? Is it possible to run it asynchronously?

Related topics