High host CPU load

user96878 · December 15, 2021, 12:12pm

Hi All

I need some help as CUDA is not my best area of expertise.
I’m experiencing high CPU load that leads to increasing LA on Linux that’s in turn leads to reboot on some systems with not power enough CPU.
The code is utilizing the cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync) however that does not help at all.
Any ideas, please? I can provide the code on a request if needed.

Thanks a lot in advance!

cbuchner1 · December 15, 2021, 2:04pm

what is the average time your kernels run for?

cudaDeviceScheduleBlockingSync probably only makes a difference if that time is same or bigger than the Linux kernel scheduler’s Jiffy time. Depending on the Linux version you use, typical Jiffy times can be 10ms, 2.5ms or 1ms.

Any blocking sync shorter than those times might not register as idle CPU time.

EDIT: checking your kernel’s setting can be done with

gunzip /proc/config.gz -c | grep CONFIG_HZ=

1000 divided by the result for CONFIG_HZ should give the Jiffy time in ms.

Christian

user96878 · December 15, 2021, 2:06pm

Thanks, Christian. But what you can suggest? How to decrease the CPU load, please?

cbuchner1 · December 15, 2021, 2:14pm

Have you tried moving around the location of cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync)?

It may only have an effect before the CUDA context is created.

This stackOverflow thread details the issue

user96878 · December 15, 2021, 2:20pm

Yes, I have tried that. Also I’ve tried to use cudaDeviceScheduleYield instead of the cudaDeviceScheduleBlockingSync - with no success though.
Any other ideas, please?

cbuchner1 · December 15, 2021, 2:35pm

another relevant thread

user96878 · December 15, 2021, 3:02pm

That sounds like a “nasty hack”… However I have nothing better atm so it worth to try. Did that hack really work for you?

And anyway… I’d like to find a proper solution for the issue…

Robert_Crovella · December 15, 2021, 3:29pm

When I create a simple application and test the difference between use of

cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync);

or not, at the point of cudaDeviceSynchronize(), what I observe is that when I don’t use it, top reports about 100% cpu usage by the process. When I do use, top reports approximately 0% cpu usage by the process.

$ cat t5.cu
const unsigned long long delay = 30000000000ULL;
__global__ void k(){

  unsigned long long start = clock64();
  while (clock64() < (start+delay));
}

int main(){
#ifdef USE_BLOCK
  cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync);
#endif
  k<<<1,1>>>();
  cudaDeviceSynchronize();
}

$ nvcc -o t5 t5.cu
$ ./t5 &
[1] 9970
$ top -p 9970
top - 10:02:37 up 297 days, 13:36,  2 users,  load average: 0.27, 0.15, 0.09
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.8 us,  1.4 sy,  0.0 ni, 95.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13183105+total, 11599987+free,  1562592 used, 14268592 buff/cache
KiB Swap:  4194300 total,  3770732 free,   423568 used. 12943844+avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 9970 user2     20   0  0.157t 131180 122196 R  99.7  0.1   0:09.94 t5

[1]+  Done                    ./t5

$ nvcc -o t5 t5.cu -DUSE_BLOCK
$ ./t5 &
[1] 10008
$ top -p 10008
top - 10:04:18 up 297 days, 13:38,  2 users,  load average: 0.11, 0.14, 0.09
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13183105+total, 11600230+free,  1560092 used, 14268668 buff/cache
KiB Swap:  4194300 total,  3770732 free,   423568 used. 12944084+avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
10008 user2     20   0  0.157t 131088 122108 S   0.0  0.1   0:00.29 t5

user96878 · December 15, 2021, 5:49pm

Thanks, Robert
That’s expected behavior as I understand. But any idea why I’m getting another results with me code?

Robert_Crovella · December 15, 2021, 5:55pm

No, I can’t explain behavior of code that I can’t inspect. And I’m unlikely to look at test cases that:

are not complete
are not short (say, less than 100 lines)
are not posted properly in this forum, using proper code formatting
are provided via an attachment
are provided via an offsite link

Do as you wish, of course.

user96878 · December 15, 2021, 7:00pm

Well. I’ve run your properly formatted code.
The result is - I can’t see any difference on my test system. The both variants give something like 0.3-3.5 us. Any idea why? Is CUDA version important?

Robert_Crovella · December 15, 2021, 7:26pm

I have no idea what that means.

user96878 · December 15, 2021, 8:04pm

%Cpu(s): 2.8 us
In my case it’s not constant but changes within the range mentioned.

Robert_Crovella · December 15, 2021, 8:13pm

Those are not the numbers of interest. We’re interested in this:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 9970 user2     20   0  0.157t 131180 122196 R  99.7  0.1   0:09.94 t5
                                                ^^^^
                                                ||||

vs. this:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
10008 user2     20   0  0.157t 131088 122108 S   0.0  0.1   0:00.29 t5
                                                ^^^^
                                                ||||

user96878 · December 15, 2021, 10:06pm

Ok. Got it. Thanks.
However, the same approach does not work in the code I have for some reason

njuffa · December 15, 2021, 10:24pm

Make sure you haven’t set CUDA_LAUNCH_BLOCKING=1 somewhere in your environment. There might be Linux configuration switches that interfere as well, but I cannot think of any off the top of my head.

Topic		Replies	Views
100% CPU Usage - Linux CUDA Programming and Performance	4	2718	February 12, 2018
100% CPU usage when running CUDA code CUDA Programming and Performance	5	4851	October 31, 2023
Best practices for cudaDeviceScheduleBlockingSync usage pattern on Linux CUDA Programming and Performance	5	3170	June 14, 2021
cpu usage while waiting for kernel CUDA Programming and Performance	4	8911	August 1, 2009
cudaSetDeviceFlags and cudaDeviceScheduleYield in embedded envirronment CUDA Programming and Performance	2	3373	September 27, 2013
Asking for clarification (Thread yield or block) CUDA Programming and Performance	2	1334	February 1, 2012
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10571	June 21, 2009
massive hiccups when transferring flags back to host challenging the commonly quoted 2-10us latency CUDA Programming and Performance	14	38113	February 1, 2011
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1716	July 19, 2022
Synchronization spins CPU under Linux - Tesla driver bug CUDA Programming and Performance	3	1084	March 5, 2015

High host CPU load

Related topics