cudaMemcpy fails copying ACC variable to CUF variable

This sample code is to expose a problem when I am trying to make cudamemcpy between the device copy of an ACC variable and a CUF device variable:

program cuf2acc

use cudafor
use openacc

implicit none

integer, parameter :: DA=8

real, allocatable, dimension(:) :: sendbuf, recvbuf
real, device, allocatable, dimension(:) :: sendbuf_d, recvbuf_d

integer :: i

call acc_init( acc_device_default )

allocate(sendbuf(DA),recvbuf(DA))
allocate(sendbuf_d(DA),recvbuf_d(DA))

sendbuf=-2. ; recvbuf=-2.
sendbuf_d=1 ; recvbuf_d=-1.

!$acc data copy( sendbuf, recvbuf )
i = cudaDeviceSynchronize()
!$acc host_data use_device( sendbuf, recvbuf )
i = cudaMemcpy(sendbuf,sendbuf_d,DA,cudaMemcpyDeviceToDevice)
i = cudaMemcpy(recvbuf,recvbuf_d,DA,cudaMemcpyDeviceToDevice)
!$acc end host_data
!$acc end data

write(*,'("S: "8F8.2)') (sendbuf(i),i=1,DA)
write(*,'("R: "8F8.2)') (recvbuf(i),i=1,DA)

end program cuf2acc

It’s compiled by

$ pgf90 -o cuf2acc -Mcuda -acc -ta=nvidia,cc20 cuf2acc.f90

When it runs it always gives an error:

0: copyover Memcpy (dst=0xa15d80, src=0xb00200000, size=32) FAILED: 11(invalid argument)

It seems that the dst is a host address. So I try removing the “host_data” construct around the memcpy, and the error is the same. It seems to me that the host_data construct never exposes the device address of the ACC variables. Why is it so and how can I get around this?

Hi rikisyo,

Are you trying to test the “host_data” directive or just need to copy the data between the device and host? The “host_data” directive isn’t supported yet in Fortran and is why this example is failing.

If you just need to copy the data, then the simple method is to just assign the host to device array. No need for OpenACC at all.

program cuf2acc

use cudafor
use openacc

implicit none

integer, parameter :: DA=8

real, allocatable, dimension(:) :: sendbuf, recvbuf
real, device, allocatable, dimension(:) :: sendbuf_d, recvbuf_d

integer :: i

call acc_init( acc_device_default )

allocate(sendbuf(DA),recvbuf(DA))
allocate(sendbuf_d(DA),recvbuf_d(DA))

sendbuf=-2. ; recvbuf=-2.
sendbuf_d=1 ; recvbuf_d=-1.

sendbuf=sendbuf_d
recvbuf=recvbuf_d

write(*,'("S: "8F8.2)') (sendbuf(i),i=1,DA)
write(*,'("R: "8F8.2)') (recvbuf(i),i=1,DA)

end program cuf2acc
  • Mat

Thanks for the reply!

My original purpose is not just copying data back-and-forth between the host and the device, but to use CUDA-aware MPI calls to do face data exchange of my multiblock CFD code. Particularly, I would be extremely happy if there is a supported way to do “CUDA-AWARE MPI AND OPENACC” as claimed in the following webpage:

https://developer.nvidia.com/content/benchmarking-cuda-aware-mpi

If host_data is not supported then there is no way these MPI API can work with ACC variables, much for the same reason as the cudaMemcpy failure. Is there a way to safely expose the device address so that cudaMemcpy can work?

Hi rikisyo,

First let me clarify that the host_data directive is supported in C and under development in Fortran.

In reading the article, it seems to me that you would simply pass in the CUDA Fortran device array. Depending on the MPI implementation, you may need to write an interface to the MPI_SENDRECV buffer.

  • Mat