can someone of you explain to me where the difference between
[*]page-locked Memory that is copied using cudaMempy() and
[*]zero copy page-locked memory
is?
I do understand that the first uses only a DMA to transfer the memory without a “stage” buffer which is faster than ordinary cudaMalloc() paired with cudaMemcpy().
And zero copy memory is accessed via a device pointer which does not invoke a copy at all.
I have only tried the second variant now, but it seems like 3 times slower in all my calculations at least when doing caculations from zero copy page-locked memory.
Is the slowdown caused by going over the PCIe Bus every time for zero copy memory? PCIe speed is about 16 GB/s and peak memory bandwidth, depending on the card however about 85 GB/s.
In case anyone wondered, the answer to the slowdown is that page pinned memory is not cached on the GPU.
So multiple read/writes are slower in the kernels.
You’re combining two concepts: pinned memory and zero-copy access to pinned memory. Pinned memory is faster for CPU-GPU copies using cudaMemcpy because the GPU can DMA to/from the memory directly, whereas pageable memory implies a CPU-side memcpy (which is why you can’t do async memcpys with pageable memory). Zero-copy memory is using pinned memory directly from a kernel, which is slower in terms of latency and bandwidth than accesses from GPU memory, but since you don’t have to do a memcpy to/from the memory after the kernel, it may make your overall application faster if it’s only being used as a one-time input or output buffer.