Performance of memcpyasync

I got this information (For all other transfers, the function is fully asynchronous. If pageable memory must first be staged to pinned memory, this will be handled asynchronously with a worker thread.) from cuda document ( for the async copy. Then I do some tests with two streams, one for memcopyasyncH2D and another for kernel computing, and there is no dependency between two streams. It seems that memcpy is not async but sync. I don’t know why. Thanks.

cudaMemcpyAsync will be synchronous if the transfer is to or from pageable memory. See here:

Async memory copies will also be synchronous if they involve host memory that is not page-locked.

Thanks. As the description in cuda programming guild, when the data size less than 64KB, MemcpyAsync is asynchronous for pageable memory. For other sizes it is synchronous.