While reading the guide about overlapping data transfer, I notice this
The host memory involved in the data transfer must be pinned memory.
However, this info is not shown at all in Nvidia doc: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html
My program has multiple independent threads, 1 thread = 1 job, so each thread use a separate cudaStream_t
. In this case, do I need pinned memory for every host transfer that uses cudaMemcpyAsync
?