cudaMemcpyAync with pageable memory overlap with kernal

if the transfer size (streamBytes) is small enough, the pageable transfer can proceed roughly as if the memory were pinned.

More specifically, with pageable memory, the first step in the process is to copy the data to a runtime-maintained pinned buffer. Then a DMA transfer to device memory is initiated. Once in the runtime-maintained pinned buffer, the transfer proceeds as if it were pinned:

The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.

As a programmer, this behavior is difficult to rely on. Therefore the programming advice is to use a pinned buffer, when you want a copy operation to overlap with a compute operation.

1 Like