Will the high-load CPU affect the performance of CUDA?

zhenling · December 6, 2019, 11:55am

Will the high-load CPU affect the performance of CUDA?
My colleague feedback to me that the high CPU load of my program will reduce the efficiency of his deep learning program (GPU). It is counter-intuitive due to we use the Ubuntu operating system, which uses the fair scheduling algorithm to schedule system resources. Is there any reasonable way to interpret this phenomenon? Or how to location the bottleneck?

cbuchner1 · December 6, 2019, 1:13pm

it’s a matter of latency.

The Linux OS has a “jiffy time” in the process/thread scheduler. Processes competing for the same CPU core will be switched round-robin style after this jiffy time interval has passed. This interval is typically 10ms on Linux distributions. Some realtime Linux kernel variants may user shorter times such as 1ms.

Your high load program might not be interrupted for up to 10ms to hand over control to the CUDA program, even if a blocking CUDA operation has recently finished.

This will mostly affect the performance of CUDA programs that use a lot of short blocking operations (short kernel launches or short d2h and h2d transfers).

It might help to assign a higher priority to the CUDA program, and possibly to experiment with different CUDA device settings concerning the mechanism for handing wait operations for the GPU.

cudaSetDeviceFlags(cudaDeviceScheduleAuto)
cudaSetDeviceFlags(cudaDeviceScheduleSpin)
cudaSetDeviceFlags(cudaDeviceScheduleYield)
cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync)

If you give a high or realtime process priority to the CUDA process you would want to use either yield or blocking sync instead of spinning (polling).

Christian

njuffa · December 6, 2019, 6:16pm

In addition to the issue with small CUDA kernels that cbuchner1 pointed out, there is also the issue that most CUDA-accelerated applications still spend some percentage of time in host code, and the run-time of that code can be negatively affected by high CPU load. For example, most CPUs use dynamic clocking these days, and CPU clocks are reduced with heavier CPU load.

Under Windows, I observe about 10% slowdown in one particular CUDA-accelerated application under heavy CPU load. It would be interesting to learn what kind of slowdown your colleague observes for his CUDA-accelerated deep-learning application under heavy CPU load.