What’s happening is that a helper thread is being spawned to handle the device to host transfer. So while we don’t call CUDA’s async memory routine, the thread, and thus the copy, is being run asynchronously to the main host thread.
Are you going to implement it in the next release?
No. The problem is that there is no call backs from the device so when doing a device to host async transfer, there isn’t a way to know that the data transfer is complete. Hence, the use of the helper thread.
In the first place, we have to link against the driver API, not the runtime API.
In the second place, these aren’t callbacks, these would require polling or waiting for the event to finish, neither of which is asynchronous from the host code.
We are looking at reimplementing the async implementation and perhaps getting rid of the auxiliary thread, but we’re not sure about the effect on performance. The real problem is that true async data copies require pinned host memory, and the interface for pinning host data is not composable, such as when the user program as well as the OpenACC runtime may try to pin and unpin the same memory.
I’m working now with 13.1 and 13.2. It looks like situation with overlapping data transfer with kernel execution was not changed significantly. Could you please comment future of this feature.
Sorry, nothing has change with regards to the synchronization behaviour. The same challenges outlined by Michael still exist and they are still investigating solutions. There is no time frame on when this can be improved.