cudaMemcpy to non-pinned memory

I was wondering whether modern GPUs allocate an internal pinned buffer on the CPU when transferring data from GPU memory to non-pinned CPU memory.

In this blog post from 2012, it is stated that when you copy data from GPU memory to non-pinned CPU memory, pinned memory is implicitly allocated and used as an intermediate buffer in the transfer.

The reason given was that the GPU cannot directly access pageable memory because of possible page faults.

However, with UVM it seems that the GPU can now handle these faults, so this allocation may be unnecessary.
Do modern GPUs now directly transfer the data, or do they still allocate intermediate pinned buffers?
If they still allocate intermediate buffers, why are they needed?

I tried doing a small test where I measure the resident set size during cudaMemcpy when the host memory is allocated using malloc and when the host memory is allocated using cudaHostAlloc.

The RSS didn’t show any considerable difference, but the execution time showed a significant difference.
Instead of allocating separate memory, could the host memory be pinned during transfer and then unpinned once complete? This way the extra cost is the time taken to pin the memory?

Pinning host memory is expensive. Which makes sense as the OS has to find a physically contiguous range of pages. Therefore it is highly advisable to pin host memory once, then re-use that pinned allocation as many times as is feasible.

For host systems with high-throughput system memory, the advantage of using pinned allocations may not be very pronounced, as host-to-host copies are fast and add little overhead.

I understand why pinning is expensive.

I’m just curious, if the host memory is left unpinned, and you perform a cudaMemcpy to it, is an extra pinned buffer allocated?
The 2012 blog post indicates so, but I thought because of UVM that may be a bit outdated.

Best I know, a pinned buffer is still allocated by the driver for such transfers. That happens once. Note that UVM stands for “unified virtual memory”, it just means that all CPU and GPU memory is mapped into a single virtual address space, with no specific guarantees about physical pages I am aware of.

If you need an authoritative answer, you will need to ask an NVIDIA employee who can confirm with the CUDA driver team.