Does cudaMemcpyAsync require pinned memory?

Yes, I believe so according to this page;

http://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/

“The host memory involved in the data transfer must be pinned memory.”

I have always used pinned memory with cudaMemcpyAsync and do see overlapping behavior.

Using 4 GB out of 64GB host memory will not degrade CPU performance. There is some additional overhead related to the initiall pinned memory allocation (more than a regular host malloc)