Does cudaMemcpyAsync require host memory to be pinned?

While reading the guide about overlapping data transfer, I notice this

The host memory involved in the data transfer must be pinned memory.

However, this info is not shown at all in Nvidia doc:

My program has multiple independent threads, 1 thread = 1 job, so each thread use a separate cudaStream_t. In this case, do I need pinned memory for every host transfer that uses cudaMemcpyAsync ?

cudaMemcpyAsync also works with ordinary pageable memory, but it will probably be a blocking call in that case. Only with pinned memory, the call can be asynchronous.

See CUDA Runtime API :: CUDA Toolkit Documentation

1 Like