cpu usage while waiting for kernel

kernel code:

[codebox]global void long_run()


shared int temp[10];

for (int i =0; i < 999999; ++i)

 temp[0] = temp[1] + temp[2];



Host code:



long_run is a kernel function which costs a lot of time, so we will have enough time to record the CPU usage in condition that GPU is running and CPU is waiting. It is a single thread program, and main thread was binded to the first core(i7 CPU 920).

When the program was running, we noticed that CPU usage of core 1 kept 100%. So our question is why the CPU usage is so high? How can we reduce it while it is waiting for GPU kernel? What is CPU doing when it is just waiting??

Thanks a lot!

It is “high” because cudaThreadSynchronize() is effectively a spin lock which polls the GPU at rather high frequency, waiting until the GPU kernel is finished. Because the CPU thread is just sitting in a polling loop, it actually isn’t doing much work. Since CUDA 2.3, I understand you can control the frequency of polling if it really bothers you.

Thanks. However, can you tell me about the problem in more detail?

Check cudaSetDeviceFlags API

I use it this way: cudaSetDeviceFlags(cudaDeviceScheduleYield | cudaDeviceMapHost);

The one that u should be interested in should be : cudaDeviceScheduleYield

Thank avidday and Sarnath, your suggestions are quite helpful.

We used cudaSetDeviceFlags with argument cudaDeviceBlockingSync, and alleviated CPU usage down to zero while waiting.

The detailed information of the function is posted below:
Records flags as the flags to use when the active host thread executes device code. If the host thread has already initialized the CUDA runtime by calling non-device management runtime functions, this call returns cudaErrorSetOnActiveProcess.

The two LSBs of the flags parameter can be used to control how the CPU thread interacts with the OS scheduler when waiting for results from the device.

cudaDeviceScheduleAuto: The default value if the flags parameter is zero, uses a heuristic based on the number of active CUDA contexts in the process C and the number of logical processors in the system P. If C > P, then CUDA will yield to other OS threads when waiting for the device, otherwise CUDA will not yield while waiting for results and actively spin on the processor.
cudaDeviceScheduleSpin: Instruct CUDA to actively spin when waiting for results from the device. This can decrease latency when waiting for the device, but may lower the performance of CPU threads if they are performing work in parallel with the CUDA thread.
cudaDeviceScheduleYield: Instruct CUDA to yield its thread when waiting for results from the device. This can increase latency when waiting for the device, but can increase the performance of CPU threads performing work in parallel with the device.
cudaDeviceBlockingSync: Instruct CUDA to block the CPU thread on a synchronization primitive when waiting for the device to finish work.
cudaDeviceMapHost: This flag must be set in order to allocate pinned host memory that is accessible to the device. If this flag is not set, cudaHostGetDevicePointer() will always return a failure code.

flags - Parameters for device operation

cudaSuccess, cudaErrorInvalidDevice, cudaErrorSetOnActiveProcess
See also:
cudaGetDeviceCount, cudaGetDevice, cudaGetDeviceProperties, cudaSetDevice, cudaSetValidDevices, cudaChooseDevice