OMP slow down CUDA???

Hello. I have mvapich 2.3 (with --enable-cuda), cuda 9.0, GTX-680

My code:

***

double time = MPI_Wtime();
cudaMemcpy(*****)
cout << " time = " << MPI_Wtime() - time << endl;

I compilate by nvcc + mpicxx with flags -fopenmp -O3, i set enviroment variables: OMP_NUM_THREADS = K, MV2_USE_CUDA 1.

Results:
K = 1 => have time(1)
K = 2 => have time(2) > time(1)

K = 8 => have time(8) > time(7) > … > time(1)

  • time(8) approximately equal to 2*time(1)

I also got a similar result when using cudaMemcpyAsync

Why is this happening? How to avoid it?

That might depend on ***, *****, the hardware you are running this on, and how you launch it / what else is running on the node.