Hello. I have mvapich 2.3 (with --enable-cuda), cuda 9.0, GTX-680
My code:
***
double time = MPI_Wtime();
cudaMemcpy(*****)
cout << " time = " << MPI_Wtime() - time << endl;
I compilate by nvcc + mpicxx with flags -fopenmp -O3, i set enviroment variables: OMP_NUM_THREADS = K, MV2_USE_CUDA 1.
Results:
K = 1 => have time(1)
K = 2 => have time(2) > time(1)
…
K = 8 => have time(8) > time(7) > … > time(1)
- time(8) approximately equal to 2*time(1)
I also got a similar result when using cudaMemcpyAsync
Why is this happening? How to avoid it?