asynchronous memory transfer

I am trying to use the asynchronous memory transfer feature to prefetch data for a kernel while a previous kernel is executing. The release notes for the Toolkit state the following:

Current hardware limits the number of asynchronous memcopies that can be overlapped with kernel execution. Overlap is also limited to kernels executing for less than 1 second. These limitations are expected to improve on future hardware.

I would like to have a better understanding of the asynchronous copy mechanism. Could someone clarify the statement in the toolkit? More specifically, how many memcopies can be overlapped? Where does the 1 second limitation come from? Are there further caveats to using asynchronous memcopies? (For example, I noticed that an asynchronous memcopy running in its own stream does not finish until previously running kernels finish executing).

You can overlap multiple memcopies in one stream while a kernel runs in another. How do you figure that a memcopy doesn’t finish until previously issue kernel finishes in another stream? Is one of the streams 0?


A few months ago I wrote a piece of code with two streams, one running a kernel and another running an asynchronous memory transfer during the kernel execution. When synchronizing the stream with the memcopy, the program did not proceed until the kernel finished execution. Neither of the streams was 0 from what I remember.