cpu usage while waiting for kernel

wakensky · July 31, 2009, 5:18am

kernel code:

[codebox]global void long_run()

{

shared int temp[10];

for (int i =0; i < 999999; ++i)

 temp[0] = temp[1] + temp[2];

__syncthreads();

}[/codebox]

Host code:

[codebox]long_run<<<2048,128>>>();

CUDA_SAFE_CALL(cudaThreadSynchronize());[/codebox]

long_run is a kernel function which costs a lot of time, so we will have enough time to record the CPU usage in condition that GPU is running and CPU is waiting. It is a single thread program, and main thread was binded to the first core(i7 CPU 920).

When the program was running, we noticed that CPU usage of core 1 kept 100%. So our question is why the CPU usage is so high? How can we reduce it while it is waiting for GPU kernel? What is CPU doing when it is just waiting??

Thanks a lot!

avidday · July 31, 2009, 5:44am

It is “high” because cudaThreadSynchronize() is effectively a spin lock which polls the GPU at rather high frequency, waiting until the GPU kernel is finished. Because the CPU thread is just sitting in a polling loop, it actually isn’t doing much work. Since CUDA 2.3, I understand you can control the frequency of polling if it really bothers you.

wakensky · July 31, 2009, 6:44am

Thanks. However, can you tell me about the problem in more detail?

Sarnath · July 31, 2009, 6:59am

Check cudaSetDeviceFlags API

I use it this way: cudaSetDeviceFlags(cudaDeviceScheduleYield | cudaDeviceMapHost);

The one that u should be interested in should be : cudaDeviceScheduleYield

wakensky · August 1, 2009, 2:30pm

Thank avidday and Sarnath, your suggestions are quite helpful.

We used cudaSetDeviceFlags with argument cudaDeviceBlockingSync, and alleviated CPU usage down to zero while waiting.

The detailed information of the function is posted below:
Records flags as the flags to use when the active host thread executes device code. If the host thread has already initialized the CUDA runtime by calling non-device management runtime functions, this call returns cudaErrorSetOnActiveProcess.

The two LSBs of the flags parameter can be used to control how the CPU thread interacts with the OS scheduler when waiting for results from the device.

cudaDeviceScheduleAuto: The default value if the flags parameter is zero, uses a heuristic based on the number of active CUDA contexts in the process C and the number of logical processors in the system P. If C > P, then CUDA will yield to other OS threads when waiting for the device, otherwise CUDA will not yield while waiting for results and actively spin on the processor.
cudaDeviceScheduleSpin: Instruct CUDA to actively spin when waiting for results from the device. This can decrease latency when waiting for the device, but may lower the performance of CPU threads if they are performing work in parallel with the CUDA thread.
cudaDeviceScheduleYield: Instruct CUDA to yield its thread when waiting for results from the device. This can increase latency when waiting for the device, but can increase the performance of CPU threads performing work in parallel with the device.
cudaDeviceBlockingSync: Instruct CUDA to block the CPU thread on a synchronization primitive when waiting for the device to finish work.
cudaDeviceMapHost: This flag must be set in order to allocate pinned host memory that is accessible to the device. If this flag is not set, cudaHostGetDevicePointer() will always return a failure code.

Parameters:
flags - Parameters for device operation

Returns:
cudaSuccess, cudaErrorInvalidDevice, cudaErrorSetOnActiveProcess
See also:
cudaGetDeviceCount, cudaGetDevice, cudaGetDeviceProperties, cudaSetDevice, cudaSetValidDevices, cudaChooseDevice

Topic		Replies	Views
CPU core is busy while GPU runs its kernel CUDA Programming and Performance	11	5227	February 11, 2018
100% CPU usage when running CUDA code CUDA Programming and Performance	5	4850	October 31, 2023
letting the host thread sleep in 2.2? CUDA Programming and Performance	8	4313	July 1, 2009
100% CPU Usage - Linux CUDA Programming and Performance	4	2717	February 12, 2018
High host CPU load CUDA Programming and Performance	15	1634	December 15, 2021
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23195	October 7, 2010
High CPU load and bad clock() return values CUDA Programming and Performance cuda	5	521	October 12, 2021
Running a kernel blocks the CPU? Is it possible to run it asynchronously? CUDA Programming and Performance	2	3485	April 21, 2009
CPU Usage CUDA Programming and Performance	6	1700	October 5, 2009
The solution to wait for the GPU? What's the best variant? CUDA Programming and Performance	2	2000	July 25, 2008

cpu usage while waiting for kernel

Related topics