Yes, I believe so according to this page;
http://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/
“The host memory involved in the data transfer must be pinned memory.”
I have always used pinned memory with cudaMemcpyAsync and do see overlapping behavior.
Using 4 GB out of 64GB host memory will not degrade CPU performance. There is some additional overhead related to the initiall pinned memory allocation (more than a regular host malloc)