CUDA 8: Uniform-memory overlapped host-device copies for Maxwell?

I’m in process of writing a deep learning library with cuDNN support, and would like to know what’s the most performant way to overlap host-device transfers and kernel execution in CUDA 8 on a Maxwell-based card…

I gather that for Pascal-based cards there’ll be cudaMemPrefetchAsync, but what about for Maxwell? Do those still have to use page-locked host memory and cudaMempyAsync, or will there be any support for overlapped copies using uniform memory instead?

What is uniform memory? Do you mean unified memory?

Oops! Yes.

Today’s implementation of UM (Unified Memory) transfers managed data at one of two points: kernel launch, and the cudaDeviceSynchronize() call after a kernel launch. Since the runtime handles it, its harder for you as a programmer to precisely control overlap of copy and compute.

Doing things manually using cudaMemcpyAsync and traditional methods still gives you the most control.

OK, thanks!

On a separate note, I’ve been experimenting with the copy-related parts of the CUDA API, and have found that cudaStreamAttachMemAsync(stream, …cudaMemAttachSingle) method works when stream is cudaStreamLegacy, but silently fails (memory not attached) when passing cudaStreamDefault or cudaStreamPerThread. Not sure if this is a bug, or as intended… I couldn’t find any mention of the intended behavior in the documentation. I would have expected it to work, but with the effective stream being the appropriate default stream (i.e current thread’s default stream when passing cudaStreamPerThread).

It could be a CUDA bug, or maybe you’ve made a mistake. Off the top of my head, I don’t know why cudaStreamAttachMemAsync semantics would vary based on the default stream behavior (after all, you are specifying a stream…) but I haven’t investigated it and there’s any number of things that might impact it that don’t immediately occur to me. I don’t generally spend any time on reported issues of this nature unless OP provides a suitable short, complete reproducer code. Even then, no guarantees (see below).

If you’re convinced that something is a defect in CUDA, and can generate a short, complete demonstration of it, the usual advice is to file a bug at

You’re welcome to discuss it here, of course, but in a community situation there are no guarantees that:

  1. anyone will read it
  2. anyone will think about it
  3. anyone will try to do something about it
  4. anyone will file a bug on your behalf