Helping Cuda on the Host to monopolize a core

I have a multi-threaded Java app that uses Cuda via JNI.

The system it is running on is RHEL 7, Cuda 10.2, 2 CPUs with 8 cores each, and 4 GPUs. Hyper-threading turned off.

I need to do some matrix multiplication. I have 8 Java threads. The first 4 threads each get one GPU to use, the last four threads use the Intel MKL to do the matrix multiplication.

For the Cuda calls, the host code fires off a matrix multiplication request, then calls cudaDeviceSynchronize (). When this returns it calls code to check if the multiplication results have converged, and makes another call to cudaDeviceSynchronize (0 while it waits for an answer.

Before I run my program, I call
with a value ranging from 4 to 16

The bigger the number I assign to MKL_NUM_THREADS, the longer the Cuda code takes to run. (Yes, the MKL code goes faster. But the speedup in the MKL code is far and away overwhelmed by the slowdown in the Cuda code.)

Is there any way I can keep the MKL code from grabbing CPU cycles from the cores that are running the Host Cuda code? Is there a better routine to call than cudaDeviceSynchronize? (I’m assuming that the slowdown comes from MKL usage of the Core meaning that cudaDeviceSynchronize doesn’t return just as soon as the Cuda code on the GPU is done.)

Any thoughts or suggestions greatly appreciated.

The CUDA driver is host code. The CUDA runtime is host code. Your own CUDA-accelerate app contains host code. All this host code competes with other host code for resources (CPU cores, access to system memory), and is also impacted by dynamic CPU down clocking due to higher load on the CPU or use of AVX2.

Reserving CPU cores for CUDA’s host code would therefore only partially address the effects you observe, but it is worth a try. For long-running applications I sometimes do this manually using system-level affinity facilities, with positive impact in the single digit percent. No two applications are alike so your results could easily differ.

Hand-wavy pointer: Intel’s runtime library provides a thread-affinity interface which you might want to explore:

I haven’t really played with this and I have never used Java; maybe other forum participants can provide deeper insights.