CPU core is busy while GPU runs its kernel

I’m trying to run my CPU and GPU in parallel. I have a compute-intensive GPU kernel and I want the CPU to do other computations while waiting for the GPU results. My code has several CPU threads, one is calling the GPU kernel, and the rest have their own compute-intentive tasks that run on the CPU itself. I use CUDA call cudaSetDeviceFlags() and expected it to yield the CPU thread that invokes the CUDA kernel while the kernel is running, so that the CPU-core will be available for other threads. In practice, no matter what parameter I give cudaSetDeviceFlags(), the thread does not yield. Any ideas?

I’m attaching part of my code and a screen shot from Nsight that shows the CPU thread running and not yield.

cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync); // I tried cudaDeviceScheduleSpin, cudaDeviceScheduleYield, cudaDeviceScheduleBlockingSync, cudaDeviceScheduleAuto. All give the same results.

cudaStatus = cudaDeviceSynchronize();
bp_2xrts_kernel<<<blocks, threads>>>(outBuffer, inBuffer);
cudaStatus = cudaDeviceSynchronize();

cudaDeviceSynchronize() blocks the CPU thread until all GPU tasks are completed. If you want to do computations on this CPU thread you should remove it and put your code there.

Hi hadschi,
I realize cudaDeviceSynchronize() should block the CPU thread, and this is exactly what I would like to happen. I would like the CPU thread to be blocked and scheduled out of the CPU by the operating system, so that other processes and other thread could run on my CPU while this thread is blocked.

The image I attached above shows that my thread takes 88.6% of the CPU. The timeline shows it as a green bar (symboling ‘running’ threads) and I expected it to be a brown bar (symboling threads that are ‘waiting’).

You suggest to put my code instead of the “cudaDeviceSynchronize()”. This is probably not possible since that code may just as well be a different application that runs in a different process.

Ah, now I understand your problem. I’ll check if I see the same on my system.


I agree with NadavSeg that the CPU is busy even with


I did the following:
On linux, I ran a simple openmp application using all CPUs. If I now start an additional CUDA application it fully uses 1 CPU core.

(In NVVP the timeline shows a brown bar but I could not find a hint in the docs that this indicates waiting. Is your image from the Visual Studio Profiler?)

Hi SPWorley,
I tried all possible options for cudaSetDeviceFlags() and they all give the same results. It seems this function has no effect at all…

I’m using Nvidia Nsight on Microsoft Visual Studio and running on Windows7. The image I attached above is a screen shot from ‘timeline’ view in Nsight. This can not be seen in the image, but hovering with the mouse on the green and brown bars opens a pop-up with more details, including thread state such as ‘running’, ‘ready’, ‘waiting’.

Ok. I am using the eclipse based profiler which seems to be different in this view.

I did the same thing with NVIDIA vectorAdd example that comes with the SDK. I changed in the code the ‘numElements’ to 300 mByte so the kernel will run for a long time. I run the code with Nsight timeline and I see the CPU is busy and noted as ‘Running’ the entire time the GPU runs the ‘vectorAdd’ kernel (instead of switching to ‘waiting’). Can anyone confirm that this is a real bug with Nvidia and that in practive the GPU can not run in parallel with the CPU?

Maybe you should just fill a bug report and see what they answer.

What is the return value of cudaSetDeviceFlags()? You must call cudaSetDeviceFlags() before any CUDA contexts have been created—that is, before most other CUDA function calls have been made. (Most CUDA functions will create a CUDA context if one does not yet exists.) Be on the lookout for CUDA calls in the constructors of globals.

cudaSetDeviceFlags() doesn’t give you fine-grain control over blocking behaviors (for instance, you cannot specify different behaviors for different GPUs in a multi-GPU system). CUDA events will give you more control. Create the event using cudaEventCreateWithFlags(), providing the flag cudaEventBlockingSync. Replace your calls to cudaDeviceSynchronize() with cudaEventRecord()/cudaEventSynchronize() pairs. You could abstract all of this within a custom function of your own (the event should probably be declared global or static). There are two drawbacks to this approach: (1) You cannot use it if you can’t modify the code in question; (2) It incurs slightly greater overheads than cudaStreamSynchronize()/cudaDeviceSynchronize().

Finally, be careful with cudaDeviceScheduleYield. You may find that you make frequent calls into the OS scheduler, incurring scheduler overheads and cache affinity loss. You may actually get better performance (both for your non-GPU-using and GPU-using applications that run concurrently) with cudaDeviceScheduleBlockingSync.

I found you have to call


before calling cudaSetDeviceFlags(…), otherwise it has no effect whatsoever.

See the documentation http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#ixzz56qIPnXa2 :
“If no device has been made current to the calling thread, then flags will be applied to the initialization of any device initialized by the calling host thread, unless that device has had its initialization flags set explicitly by this or any host thread.”

Which essentially means “If no device has been made current to the calling thread [before calling cudaSetDeviceFlags()] undefined behavior results in case of a multi-threaded application”.