I am trying to use the asynchronous memory transfer feature to prefetch data for a kernel while a previous kernel is executing. The release notes for the Toolkit state the following:
Current hardware limits the number of asynchronous memcopies that can be overlapped with kernel execution. Overlap is also limited to kernels executing for less than 1 second. These limitations are expected to improve on future hardware.
I would like to have a better understanding of the asynchronous copy mechanism. Could someone clarify the statement in the toolkit? More specifically, how many memcopies can be overlapped? Where does the 1 second limitation come from? Are there further caveats to using asynchronous memcopies? (For example, I noticed that an asynchronous memcopy running in its own stream does not finish until previously running kernels finish executing).