update host async

Hi all!

Here is a test code

subroutine zz(a, b)
INTEGER, PARAMETER :: Nvec = 10000, Nchunks = 10000
REAL*8 :: a(*), b(*)
!$acc data create(a(1:Nvec*Nchunks),b(1:Nvec*Nchunks))
DO j = 0,Nchunks-1
!$acc update device(a(k:l)) async(j)
!$acc parallel loop async(j)
DO i = 1,Nvec
b(k+i-1) = SQRT(a(k+i-1)*2d0)
!$acc update host(b(k:l)) async(j)
!$acc wait
!$acc end data

end subroutine

Program main
INTEGER, PARAMETER :: Nvec = 10000, Nchunks = 10000
REAL*8 :: a(1:Nvec*Nchunks), b(1:Nvec*Nchunks)
DO j = 0,Nchunks*Nvec-1
call zz(a,b)
write(*,*) "sum = ",SUM(b)

profiler shows that “update host async” directive produce synchronous call

method=[ memcpyHtoDasync ] gputime=[ 52.672 ] cputime=[ 7.000 ]
method=[ zz_10_gpu ] gputime=[ 5.856 ] cputime=[ 7.000 ] occupancy=[ 0.667 ]
method=[ memcpyDtoH ] gputime=[ 49.184 ] cputime=[ 120.000 ]
method=[ memcpyHtoDasync ] gputime=[ 53.856 ] cputime=[ 6.000 ]
method=[ zz_10_gpu ] gputime=[ 5.248 ] cputime=[ 7.000 ] occupancy=[ 0.667 ]
method=[ memcpyDtoH ] gputime=[ 49.143 ] cputime=[ 121.000 ]

Is it my error or “async” was not implemented yet

Hi Alexey,

What’s happening is that a helper thread is being spawned to handle the device to host transfer. So while we don’t call CUDA’s async memory routine, the thread, and thus the copy, is being run asynchronously to the main host thread.

Hope this helps,

Thanks Mat!

Are you going to implement it in the next release?


Are you going to implement it in the next release?

No. The problem is that there is no call backs from the device so when doing a device to host async transfer, there isn’t a way to know that the data transfer is complete. Hence, the use of the helper thread.

  • Mat


You have cuda on lower layer. So use

cudaError_t 	cudaStreamCreate (cudaStream_t *pStream)
cudaError_t 	cudaStreamDestroy (cudaStream_t stream)
cudaError_t 	cudaStreamQuery (cudaStream_t stream)
cudaError_t 	cudaStreamSynchronize (cudaStream_t stream)

In the first place, we have to link against the driver API, not the runtime API.
In the second place, these aren’t callbacks, these would require polling or waiting for the event to finish, neither of which is asynchronous from the host code.
We are looking at reimplementing the async implementation and perhaps getting rid of the auxiliary thread, but we’re not sure about the effect on performance. The real problem is that true async data copies require pinned host memory, and the interface for pinning host data is not composable, such as when the user program as well as the OpenACC runtime may try to pin and unpin the same memory.

Hi guys,

I’m working now with 13.1 and 13.2. It looks like situation with overlapping data transfer with kernel execution was not changed significantly. Could you please comment future of this feature.


Hi Alexey,

Sorry, nothing has change with regards to the synchronization behaviour. The same challenges outlined by Michael still exist and they are still investigating solutions. There is no time frame on when this can be improved.

  • Mat