Multi-GPU Memory Communication

Dear all,

I have problem on my Multi-GPU program. I need to combine part of the vector from each device to a complete vector for each device.

Here is my part of the code:

for(devID = 0; devID < N_GPU; devID++) {
        cudaSetDevice(devID);
        cublasSaxpy(handle_M[devID], N_M, &plus, d_R_M[devID], 1, d_P_M[devID], 1);
      }
      for(devID = 0; devID < N_GPU; devID++) {
        cudaSetDevice(devID);
        cudaMemcpyAsync(h_P_M[devID], d_P_M[devID], vectSize_M, cudaMemcpyDefault); // use async and pinned memory
      }
      for(devID = 0; devID < N_GPU; devID++) {
        cudaSetDevice(devID);  
        cudaDeviceSynchronize();
      }
      for(devID = 0; devID < N_GPU; devID++) {
        cudaSetDevice(devID);
        cudaMemcpyAsync(d_P[devID], h_P, vectSize, cudaMemcpyDefault);
      }
      for(devID = 0; devID < N_GPU; devID++) {
        cudaSetDevice(devID);  
        cudaDeviceSynchronize();
      }

If you see my code, I’m doing axpy distributed on each device to generate part of P vector (d_P_M[devID]). After it finish, d_P_M is copied to host, to be combined as h_P. Where:

for(devID = 0; devID < N_GPU; devID++) {
    h_P_M[devID] = h_P+(devID*mtxSize/N_GPU);
  }

And this combined h_P is distributed to each device as d_P[devID]. The problem is, this process (device copy to host to be combined, then host copy the combined data to device) take most of the time. Is it possible to combined d_P_M[devID] into d_P[devID] directly without help of host? Or is there any way to optimize this operation?

Thank you,