High host CPU load

Hi All

I need some help as CUDA is not my best area of expertise.
I’m experiencing high CPU load that leads to increasing LA on Linux that’s in turn leads to reboot on some systems with not power enough CPU.
The code is utilizing the cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync) however that does not help at all.
Any ideas, please? I can provide the code on a request if needed.

Thanks a lot in advance!

what is the average time your kernels run for?

cudaDeviceScheduleBlockingSync probably only makes a difference if that time is same or bigger than the Linux kernel scheduler’s Jiffy time. Depending on the Linux version you use, typical Jiffy times can be 10ms, 2.5ms or 1ms.

Any blocking sync shorter than those times might not register as idle CPU time.

EDIT: checking your kernel’s setting can be done with

gunzip /proc/config.gz -c | grep CONFIG_HZ=

1000 divided by the result for CONFIG_HZ should give the Jiffy time in ms.

Christian

Thanks, Christian. But what you can suggest? How to decrease the CPU load, please?

Have you tried moving around the location of cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync)?

It may only have an effect before the CUDA context is created.

This stackOverflow thread details the issue

Yes, I have tried that. Also I’ve tried to use cudaDeviceScheduleYield instead of the cudaDeviceScheduleBlockingSync - with no success though.
Any other ideas, please?

another relevant thread

That sounds like a “nasty hack”… However I have nothing better atm so it worth to try. Did that hack really work for you?

And anyway… I’d like to find a proper solution for the issue…

When I create a simple application and test the difference between use of

cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync);

or not, at the point of cudaDeviceSynchronize(), what I observe is that when I don’t use it, top reports about 100% cpu usage by the process. When I do use, top reports approximately 0% cpu usage by the process.

$ cat t5.cu
const unsigned long long delay = 30000000000ULL;
__global__ void k(){

  unsigned long long start = clock64();
  while (clock64() < (start+delay));
}

int main(){
#ifdef USE_BLOCK
  cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync);
#endif
  k<<<1,1>>>();
  cudaDeviceSynchronize();
}

$ nvcc -o t5 t5.cu
$ ./t5 &
[1] 9970
$ top -p 9970
top - 10:02:37 up 297 days, 13:36,  2 users,  load average: 0.27, 0.15, 0.09
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.8 us,  1.4 sy,  0.0 ni, 95.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13183105+total, 11599987+free,  1562592 used, 14268592 buff/cache
KiB Swap:  4194300 total,  3770732 free,   423568 used. 12943844+avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 9970 user2     20   0  0.157t 131180 122196 R  99.7  0.1   0:09.94 t5

[1]+  Done                    ./t5

$ nvcc -o t5 t5.cu -DUSE_BLOCK
$ ./t5 &
[1] 10008
$ top -p 10008
top - 10:04:18 up 297 days, 13:38,  2 users,  load average: 0.11, 0.14, 0.09
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13183105+total, 11600230+free,  1560092 used, 14268668 buff/cache
KiB Swap:  4194300 total,  3770732 free,   423568 used. 12944084+avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
10008 user2     20   0  0.157t 131088 122108 S   0.0  0.1   0:00.29 t5

Thanks, Robert
That’s expected behavior as I understand. But any idea why I’m getting another results with me code?

No, I can’t explain behavior of code that I can’t inspect. And I’m unlikely to look at test cases that:

  • are not complete
  • are not short (say, less than 100 lines)
  • are not posted properly in this forum, using proper code formatting
  • are provided via an attachment
  • are provided via an offsite link

Do as you wish, of course.

Well. I’ve run your properly formatted code.
The result is - I can’t see any difference on my test system. The both variants give something like 0.3-3.5 us. Any idea why? Is CUDA version important?

I have no idea what that means.

%Cpu(s): 2.8 us
In my case it’s not constant but changes within the range mentioned.

Those are not the numbers of interest. We’re interested in this:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 9970 user2     20   0  0.157t 131180 122196 R  99.7  0.1   0:09.94 t5
                                                ^^^^
                                                ||||

vs. this:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
10008 user2     20   0  0.157t 131088 122108 S   0.0  0.1   0:00.29 t5
                                                ^^^^
                                                ||||

Ok. Got it. Thanks.
However, the same approach does not work in the code I have for some reason

Make sure you haven’t set CUDA_LAUNCH_BLOCKING=1 somewhere in your environment. There might be Linux configuration switches that interfere as well, but I cannot think of any off the top of my head.