cudaStreamAttachMemAsync behavior questions

I’m in process of adding GPU support to a deep learning library. I’m still at the design/experimentation stage of how to best utilize CUDA/cuDNN to maximize performance on both Maxwell and Pascal-based cards, particularly with regard to overlapping host-GPU copies and kernel execution.

On Pascal it seems that unified memory is the way to go, and that cudaMemPrefetchAsync() can be used to achieve copy-compute overlap (btw, have any CUDA 8,0 API docs been released yet?).

For Maxwell, I’m tending towards also using unified memory for a forward looking design, and it seems that cudaStreamAttachMemAsync(…, cudaMemAttachSingle) can be used to achieve copy-compute overlap (in addition to it’s primary purpose of limiting memory visibility to the given stream):

http://stackoverflow.com/questions/23518299/unified-memory-and-streams-in-c

I’ve got some questions about use of this sort of use of cudaStreamAttachMemAsync(cudaMemAttachSingle):

  1. Are there any caveats as to when it will or won’t cause memory copies to be overlapped with kernels in other streams? Is this behavior documented anywhere?

  2. The recommended way to optimize copy-compute overlap appears to be to DIY using cudaMemcpyAsync rather than cudaStreamAttachMemAsync(cudaMemAttachSingle). Is there any reason, when using unified memory, to prefer cudaMemcpyAsync?

  3. For a modular design (pre-Pascal where on-demand paging takes care of it) I’d like to release unified memory from the GPU stream after using it, since I don’t know where the memory will next be used (host or some GPU stream). I’d like to know what is the exact effect of cudaStreamAttachMemAsync(cudaMemAttachHost)… Does this just release the unified memory from the GPU so that the host can access/copy it back on demand, or does this call cause the device-to-host copy to occur regardless of whether the host subsequently accesses the memory or not?

  4. Is cudaStreamAttachMemAsync smart enough to not copy memory unless is has changed? e.g. If my library (due to users calling pattern) made back-to-back cudaStreamAttachMemAsync(cudaMemAttachSingle) calls, would the second call copy memory a second time even though it hadn’t changed? What if there was an intervening cudaStreamAttachMemAsync(cudaMemAttachHost) call - but no memory modification by the host - would the second “memAttachSingle” call re-copy memory to the GPU?

  5. Finally, for what CUDA versions and device architectures (Maxwell, Pascal) does cudaMallocManaged() cause device sync or not? The CUDA 7.5 documentation says it (always) does, but presumably this isn’t the case for Pascal where cudaMallocManaged is essentially allocating virtual memory in the host address space and Nvidia has indicated future support of the c++ “new” operator allocating unified memory.

Sorry for so many questions!