can someone of you explain to me where the difference between
page-locked Memory that is copied using cudaMempy() and
zero copy page-locked memory
I do understand that the first uses only a DMA to transfer the memory without a “stage” buffer which is faster than ordinary cudaMalloc() paired with cudaMemcpy().
And zero copy memory is accessed via a device pointer which does not invoke a copy at all.
I have only tried the second variant now, but it seems like 3 times slower in all my calculations at least when doing caculations from zero copy page-locked memory.
Is the slowdown caused by going over the PCIe Bus every time for zero copy memory? PCIe speed is about 16 GB/s and peak memory bandwidth, depending on the card however about 85 GB/s.
Thanks and best regards, tdhd