Zero Copy Access CUDA Pipeline

Hello all,

I have some code which executes the same function on the host, and then on the device, and outputs each to a file (it’s just some array arithmetic, nothing exciting).

I was using the standard method of allocating memory on the device, and then copying the array contents from the host (see “Standard CUDA pipeline”:

I was unhappy with the results, as it was not much faster than the CPU, so I implemented the “Zero Copy Access” method and got a tremendous improvement in speed (and verified that the 2 outputs, from CPU and GPU were the same).

HOWEVER: This DRASTICALLY slowed down the HOST (CPU) execution speed… by an order of magnitude!!!

Why is the CPU execution on the host arrays so slow when using this zero copy access method???

Thanks for the help


cudaHostAlloc doesn’t guarantee fast speed, check

Unified Memory offers a “single-pointer-to-data” model that is conceptually similar to CUDA’s zero-copy memory. One key difference between the two is that with zero-copy allocations the physical location of memory is pinned in CPU system memory such that a program may have fast or slow access to it depending on where it is being accessed from. Unified Memory, on the other hand, decouples memory and execution spaces so that all data accesses are fast.

extremely helpful, thank you!!