I have a multi-threaded Java app that uses Cuda via JNI.
The system it is running on is RHEL 7, Cuda 10.2, 2 CPUs with 8 cores each, and 4 GPUs. Hyper-threading turned off.
I need to do some matrix multiplication. I have 8 Java threads. The first 4 threads each get one GPU to use, the last four threads use the Intel MKL to do the matrix multiplication.
For the Cuda calls, the host code fires off a matrix multiplication request, then calls cudaDeviceSynchronize (). When this returns it calls code to check if the multiplication results have converged, and makes another call to cudaDeviceSynchronize (0 while it waits for an answer.
Before I run my program, I call
export MKL_NUM_THREADS=
with a value ranging from 4 to 16
The bigger the number I assign to MKL_NUM_THREADS, the longer the Cuda code takes to run. (Yes, the MKL code goes faster. But the speedup in the MKL code is far and away overwhelmed by the slowdown in the Cuda code.)
Is there any way I can keep the MKL code from grabbing CPU cycles from the cores that are running the Host Cuda code? Is there a better routine to call than cudaDeviceSynchronize? (I’m assuming that the slowdown comes from MKL usage of the Core meaning that cudaDeviceSynchronize doesn’t return just as soon as the Cuda code on the GPU is done.)
Any thoughts or suggestions greatly appreciated.