Mpi+cuda fortran

Hi, all!
I try to use MPI to manage codes running on GPUs. When exchanging information among GPUs, first I use cudaMemcpy transfer arrays from GPU to CPU, then use MPI_send and MPI_recv transfer to other process, and use cudaMemcpy transfer arrays from CPU to GPU.
I find cudaMemcpy will cost a lot of time compare to computing. So I wondering how to improve the efficiency of information transfer between different GPUs. Should I follow this path (GPU-CPU-GPU) when transfer information? Is there a better way to deal with arrays transfer?

Thanks.

Sorry, I made mistakes on using cudaMemcpy. This problem has been solved.