CUDA-aware MPI on 1 GPU transferring data to host?


I am using CUDA-aware MPI with PGI 17.7 with the OpenMPI that came with the compiler.

The code works but when I profile the code, the MPI routines are taking longer than they did using the CPU-only code.

Using pgprof, I can see several async memory transfers happening from the device to host and host to device around the MPI calls.

I am running the code on only 1 GPU (a GeForce 970) so I do not understand why the code is making these transfers.

(I have learned that GeForce does not support GPUdirect RDMA, but even so, if the MPI destination is the same card, shouldn’t the compiler/library use a device-to-device copy instead of the host transfers?)


Hi Sumseq,

Yes, MPI-aware should make device to device memory transfers even on a GeForce card. RDMA buffers aren’t necessary since the transfer should be done over IPC.

While not a GeForce 970, I was able to test a MPI code on a GTX 690. In my example, I see no extra HtoD or DtoH transfers and only a DtoD.

With your code, do you use the OpenACC “host_data” construct around your MPI calls so that the device data is passed to MPI? Or if using CUDA Fortran, are you passing in device arrays?



I am using the “host_data” as follows:

!$acc host_data use_device(v%r)
      call MPI_Irecv (v%r(:,:,  1),lbuf3r,ntype_real,iproc_pm,tagr,
     &                comm_all,req(1),ierr)
      call MPI_Irecv (v%r(:,:,n3r),lbuf3r,ntype_real,iproc_pp,tagr,
     &                comm_all,req(2),ierr)
      call MPI_Isend (v%r(:,:,n3r-1),lbuf3r,ntype_real,iproc_pp,tagr,
     &                comm_all,req(3),ierr)
      call MPI_Isend (v%r(:,:,    2),lbuf3r,ntype_real,iproc_pm,tagr,
     &                comm_all,req(4),ierr)

      call MPI_Waitall (4,req,MPI_STATUSES_IGNORE,ierr)
!$acc end host_data

where v is a derived type that I copied to the device using deepcopy

enter data copyin(v)


I also have “managed” turned on as I have been having problems with some arrays with it turned off.

Hi sumseq,

By “deepcopy” are you meaning the new PGI 17.7 implicit deep copy beta feature, i.e. “-ta=tesla:deepcopy” or are you manually deep copying the structure. For example:

!$acc enter data copyin(v)
!$acc enter data copyin(v%r)

In my example I wasn’t using a user defined type, but try to make a reproducing example similar to yours. If you’re using implicit deep copy, it’s possible that this is a bug with the beta feature when interacting with the host_data construct. I’ll investigate.



Yes, I mean the beta deepcopy feature in 17.7.

Hi sumseq,

I updated my test OpenMPI+OpenACC test program to pass allocatable array data members of a UDT to MPI_SEND/MPI_RECIEVE wrapped in a host_data directive. As before, I only see the DtoD transfers associated with the MPI calls.

The only HtoD transfer I have is for a Fortran descriptor being passed to the compute region just after the MPI calls.

Does your profile show any DtoD transfers? Could the HtoD and/or DtoH transfers be accounted for by some other data movement?

If you want, you can send a copy of the code to PGI Customer Service ( and I can take a look and try to determine where the extra copies are coming from.



Could it have to do with my arrays being declared as pointers instead of allocatable?

I highly doubt it. Pointers would be treated the same as an allocatable where in both cases, “host_data” would pass in the device pointer to your MPI call.