CPU core is busy while GPU runs its kernel

NadavSeg · June 23, 2014, 12:32pm

Hi
I’m trying to run my CPU and GPU in parallel. I have a compute-intensive GPU kernel and I want the CPU to do other computations while waiting for the GPU results. My code has several CPU threads, one is calling the GPU kernel, and the rest have their own compute-intentive tasks that run on the CPU itself. I use CUDA call cudaSetDeviceFlags() and expected it to yield the CPU thread that invokes the CUDA kernel while the kernel is running, so that the CPU-core will be available for other threads. In practice, no matter what parameter I give cudaSetDeviceFlags(), the thread does not yield. Any ideas?

I’m attaching part of my code and a screen shot from Nsight that shows the CPU thread running and not yield.

cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync); // I tried cudaDeviceScheduleSpin, cudaDeviceScheduleYield, cudaDeviceScheduleBlockingSync, cudaDeviceScheduleAuto. All give the same results.

cudaStatus = cudaDeviceSynchronize();
bp_2xrts_kernel<<<blocks, threads>>>(outBuffer, inBuffer);
cudaStatus = cudaDeviceSynchronize();

External Media

hadschi118 · June 23, 2014, 1:30pm

cudaDeviceSynchronize() blocks the CPU thread until all GPU tasks are completed. If you want to do computations on this CPU thread you should remove it and put your code there.

NadavSeg · June 23, 2014, 1:51pm

Hi hadschi,
I realize cudaDeviceSynchronize() should block the CPU thread, and this is exactly what I would like to happen. I would like the CPU thread to be blocked and scheduled out of the CPU by the operating system, so that other processes and other thread could run on my CPU while this thread is blocked.

The image I attached above shows that my thread takes 88.6% of the CPU. The timeline shows it as a green bar (symboling ‘running’ threads) and I expected it to be a brown bar (symboling threads that are ‘waiting’).

You suggest to put my code instead of the “cudaDeviceSynchronize()”. This is probably not possible since that code may just as well be a different application that runs in a different process.

hadschi118 · June 23, 2014, 1:55pm

Ah, now I understand your problem. I’ll check if I see the same on my system.

SPWorley · June 23, 2014, 3:21pm

cudaSetDeviceFlags(cudaDeviceScheduleYield);

hadschi118 · June 23, 2014, 3:43pm

I agree with NadavSeg that the CPU is busy even with

cudaSetDeviceFlags(cudaDeviceScheduleYield);

I did the following:
On linux, I ran a simple openmp application using all CPUs. If I now start an additional CUDA application it fully uses 1 CPU core.

(In NVVP the timeline shows a brown bar but I could not find a hint in the docs that this indicates waiting. Is your image from the Visual Studio Profiler?)

NadavSeg · June 24, 2014, 5:48am

Hi SPWorley,
I tried all possible options for cudaSetDeviceFlags() and they all give the same results. It seems this function has no effect at all…

Hadschi118,
I’m using Nvidia Nsight on Microsoft Visual Studio and running on Windows7. The image I attached above is a screen shot from ‘timeline’ view in Nsight. This can not be seen in the image, but hovering with the mouse on the green and brown bars opens a pop-up with more details, including thread state such as ‘running’, ‘ready’, ‘waiting’.

hadschi118 · June 24, 2014, 7:30am

Ok. I am using the eclipse based profiler which seems to be different in this view.

NadavSeg · July 3, 2014, 9:00am

Hi
I did the same thing with NVIDIA vectorAdd example that comes with the SDK. I changed in the code the ‘numElements’ to 300 mByte so the kernel will run for a long time. I run the code with Nsight timeline and I see the CPU is busy and noted as ‘Running’ the entire time the GPU runs the ‘vectorAdd’ kernel (instead of switching to ‘waiting’). Can anyone confirm that this is a real bug with Nvidia and that in practive the GPU can not run in parallel with the CPU?

hadschi118 · July 4, 2014, 8:07am

Maybe you should just fill a bug report and see what they answer.

Arakageeta · July 7, 2014, 1:03pm

What is the return value of cudaSetDeviceFlags()? You must call cudaSetDeviceFlags() before any CUDA contexts have been created—that is, before most other CUDA function calls have been made. (Most CUDA functions will create a CUDA context if one does not yet exists.) Be on the lookout for CUDA calls in the constructors of globals.

cudaSetDeviceFlags() doesn’t give you fine-grain control over blocking behaviors (for instance, you cannot specify different behaviors for different GPUs in a multi-GPU system). CUDA events will give you more control. Create the event using cudaEventCreateWithFlags(), providing the flag cudaEventBlockingSync. Replace your calls to cudaDeviceSynchronize() with cudaEventRecord()/cudaEventSynchronize() pairs. You could abstract all of this within a custom function of your own (the event should probably be declared global or static). There are two drawbacks to this approach: (1) You cannot use it if you can’t modify the code in question; (2) It incurs slightly greater overheads than cudaStreamSynchronize()/cudaDeviceSynchronize().

Finally, be careful with cudaDeviceScheduleYield. You may find that you make frequent calls into the OS scheduler, incurring scheduler overheads and cache affinity loss. You may actually get better performance (both for your non-GPU-using and GPU-using applications that run concurrently) with cudaDeviceScheduleBlockingSync.

max.wittal · February 11, 2018, 10:47pm

I found you have to call

cudaSetDevice(0);

before calling cudaSetDeviceFlags(…), otherwise it has no effect whatsoever.

See the documentation http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#ixzz56qIPnXa2 :
“If no device has been made current to the calling thread, then flags will be applied to the initialization of any device initialized by the calling host thread, unless that device has had its initialization flags set explicitly by this or any host thread.”

Which essentially means “If no device has been made current to the calling thread [before calling cudaSetDeviceFlags()] undefined behavior results in case of a multi-threaded application”.

Max

Topic		Replies	Views
cudaSetDeviceFlags and cudaDeviceScheduleYield in embedded envirronment CUDA Programming and Performance	2	3380	September 27, 2013
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23202	October 7, 2010
cudaSetDeviceFlags and cudaDeviceScheduleYield in embedded envirronment CUDA Setup and Installation	1	891	September 24, 2013
cpu usage while waiting for kernel CUDA Programming and Performance	4	8917	August 1, 2009
letting the host thread sleep in 2.2? CUDA Programming and Performance	8	4315	July 1, 2009
100% CPU Usage - Linux CUDA Programming and Performance	4	2731	February 12, 2018
Language confusion with multi-gpu CUDA Programming and Performance	11	19295	October 30, 2007
cudaDeviceScheduleBlockingSync & multi-gpu How to use BlockingSync w/ multiple devices? CUDA Programming and Performance	3	6549	April 13, 2011
100% CPU usage when running CUDA code CUDA Programming and Performance	5	4946	October 31, 2023
multi-GPU parallel operation CUDA Programming and Performance	4	4031	May 1, 2008

CPU core is busy while GPU runs its kernel

Related topics