Zero Copy performance problem


I have modified the matrixMul sample to use zero copy.
The original function was not including the time spent in data transfer, so I modified it too.
For 1 iteration of the kernel (after the warm-up which is not included in the timing); the cudaMemCopy() version is taking 1.9msec but the zerocopy version is taking 2.7ms.
For 0-copy I am using
cudaHostAlloc((void **)&h_A, mem_size_A, cudaHostAllocMapped)
cudaHostGetDevicePointer((void **)&d_A, (void *)h_A, 0)

is there a crucial step I am missing? Am I bypassing the GPU’s L2 cache? Does the 0-copy use coalescing when accessing CPU RAM?
My motivation behind using 0-copy is to set me free from the GPU RAM size limitation, but the penalty I am experiencing seems excessive.
Many thanks in advance.

Zero-copy isn’t free. If it were, there’d be no reason to have cudaMemcpy or any memory on the GPU board.

In general, access to zero-copy memory is slower than access to GPU onboard memory.

It’s not clear what performance expectations you have, but I would expect any code to slow down when switching from using GPU onboard memory to zero-copy. The fact that you only experienced a 50% slowdown is impressive to me. Zero-copy best possible access bandwidth would be the PCIE transfer speed, so ~6GB/s on a Gen2 link. GPU onboard memory can easily be in the 60-120GB/s range depending on your GPU, so easily 10-20 times faster. If your code makes repeated access to memory, you’re going to pay a huge price if that repeated access is going to zero-copy instead of onboard memory.

Hi Robert! Could you advise regarding to the minimum zero copy mapping “packet size”? I mean a situation, when the kernel reads small size data blocks randomly from the host pinned memory area. The literature advises use of coalesced reads, although I think the real reason is that mapping happens in defined size of “data packets”. (Similary to global memory access where the cache size defines how large data block reads/writes are optimal.) Is this true? Can you advise the minimum size of these packets? Thanks in advance!

As far as I know this is not specified anywhere. When things aren’t specified that has a few ramifications:

  • They might be not specified for a reason
  • Not specified sometimes means that CUDA architects don’t want to pin this down for some reason (e.g. for flexibility in designing future compatible architectures)
  • I generally don’t have permission to release non-public information, except as can be derived from inspection that anyone could do.

From what I have seen of NVIDIA GPU architectures, relating to global memory space accesses (which the system memory is part of the logical global space when using zero copy), a 32-byte minimum “packet size” is probably a good guess. This applies to L2 and modern L1, based on what I have been able to test, and I suspect might be a “granularity of request” when the SM is issuing PCIE (host) cycles for zero-copy. Naturally these 32 bytes are adjacent, and probably aligned. There probably is also a mechanism to limit the request when hitting the end of an allocation.

Perhaps importantly, my tests suggest these cycles are not cached in L2, so the “mentality” for optimizing traffic would/should be very similar to the “mentality” you would have as a GPU programmer on cc1.x GPUs, which did not have L2 cache. Try to organize cycles/requests across the warp that are:

  • aligned
  • adjacent
  • in units of at least 32 bytes, and ideally with 32-byte granularity (so 32 bytes or 64 bytes or 96 bytes or …)
  • use all the bytes you request
  • requesting more at once (per warp-request) is better
  • spread the requests across many warps and SMs (don’t assume you can saturate PCIE from a single warp or a single SM)
  • try not request the same data twice (perhaps use shared memory for example)

Those would be my general suggestions, and they should be roughly consistent with optimal device memory usage in a cc1.x environment (which has been obsolete for years, so it’s just a historical footnote at this point).

Things that I have said which are not specified might be incorrect or might change in the future. YMMV.

Also note that this is a pretty old thread. Numbers I mentioned years ago are no longer typical (like PCIE Gen2, device memory speeds, etc.)

caveat: these are just my suggestions, not any statements of specification

suggestion: perhaps consider managed memory to handle these cases, if you would like GPU L2 caching of system memory.

Thanks, Robert for the very detailed answer, it helped a lot, first of all in understanding the mechanism behind! My plan is exactly to spread requests across SMs and warps, where some alignment is possible, but that requirement is not fully met as during runtime the kernel decides what data to read (from a video stream). Although by knowing the PCIE transaction granurality - I think - it will work fine.

I understand, this is not an official specification, but some tests can refine the optimal granurality size - if required at all.

I suspect the kernel calculations will hide the mapping latency, but maybe other memory transaction types must be ckecked too. Thanks a lot!