Zero Copy performance problem

Vectorizer · October 11, 2014, 9:09pm

Greetings,

I have modified the matrixMul sample to use zero copy.
The original function was not including the time spent in data transfer, so I modified it too.
For 1 iteration of the kernel (after the warm-up which is not included in the timing); the cudaMemCopy() version is taking 1.9msec but the zerocopy version is taking 2.7ms.
For 0-copy I am using
cudaHostAlloc((void **)&h_A, mem_size_A, cudaHostAllocMapped)
and
cudaHostGetDevicePointer((void **)&d_A, (void *)h_A, 0)

is there a crucial step I am missing? Am I bypassing the GPU’s L2 cache? Does the 0-copy use coalescing when accessing CPU RAM?
My motivation behind using 0-copy is to set me free from the GPU RAM size limitation, but the penalty I am experiencing seems excessive.
Many thanks in advance.

Robert_Crovella · October 12, 2014, 12:04am

Zero-copy isn’t free. If it were, there’d be no reason to have cudaMemcpy or any memory on the GPU board.

In general, access to zero-copy memory is slower than access to GPU onboard memory.

It’s not clear what performance expectations you have, but I would expect any code to slow down when switching from using GPU onboard memory to zero-copy. The fact that you only experienced a 50% slowdown is impressive to me. Zero-copy best possible access bandwidth would be the PCIE transfer speed, so ~6GB/s on a Gen2 link. GPU onboard memory can easily be in the 60-120GB/s range depending on your GPU, so easily 10-20 times faster. If your code makes repeated access to memory, you’re going to pay a huge price if that repeated access is going to zero-copy instead of onboard memory.

adam.zborovszky · June 29, 2021, 5:16pm

Hi Robert! Could you advise regarding to the minimum zero copy mapping “packet size”? I mean a situation, when the kernel reads small size data blocks randomly from the host pinned memory area. The literature advises use of coalesced reads, although I think the real reason is that mapping happens in defined size of “data packets”. (Similary to global memory access where the cache size defines how large data block reads/writes are optimal.) Is this true? Can you advise the minimum size of these packets? Thanks in advance!

Robert_Crovella · June 30, 2021, 5:55pm

As far as I know this is not specified anywhere. When things aren’t specified that has a few ramifications:

They might be not specified for a reason
Not specified sometimes means that CUDA architects don’t want to pin this down for some reason (e.g. for flexibility in designing future compatible architectures)
I generally don’t have permission to release non-public information, except as can be derived from inspection that anyone could do.

From what I have seen of NVIDIA GPU architectures, relating to global memory space accesses (which the system memory is part of the logical global space when using zero copy), a 32-byte minimum “packet size” is probably a good guess. This applies to L2 and modern L1, based on what I have been able to test, and I suspect might be a “granularity of request” when the SM is issuing PCIE (host) cycles for zero-copy. Naturally these 32 bytes are adjacent, and probably aligned. There probably is also a mechanism to limit the request when hitting the end of an allocation.

Perhaps importantly, my tests suggest these cycles are not cached in L2, so the “mentality” for optimizing traffic would/should be very similar to the “mentality” you would have as a GPU programmer on cc1.x GPUs, which did not have L2 cache. Try to organize cycles/requests across the warp that are:

aligned
adjacent
in units of at least 32 bytes, and ideally with 32-byte granularity (so 32 bytes or 64 bytes or 96 bytes or …)
use all the bytes you request
requesting more at once (per warp-request) is better
spread the requests across many warps and SMs (don’t assume you can saturate PCIE from a single warp or a single SM)
try not request the same data twice (perhaps use shared memory for example)

Those would be my general suggestions, and they should be roughly consistent with optimal device memory usage in a cc1.x environment (which has been obsolete for years, so it’s just a historical footnote at this point).

Things that I have said which are not specified might be incorrect or might change in the future. YMMV.

Also note that this is a pretty old thread. Numbers I mentioned years ago are no longer typical (like PCIE Gen2, device memory speeds, etc.)

caveat: these are just my suggestions, not any statements of specification

suggestion: perhaps consider managed memory to handle these cases, if you would like GPU L2 caching of system memory.

adam.zborovszky · July 6, 2021, 6:33pm

Thanks, Robert for the very detailed answer, it helped a lot, first of all in understanding the mechanism behind! My plan is exactly to spread requests across SMs and warps, where some alignment is possible, but that requirement is not fully met as during runtime the kernel decides what data to read (from a video stream). Although by knowing the PCIE transaction granurality - I think - it will work fine.

I understand, this is not an official specification, but some tests can refine the optimal granurality size - if required at all.

I suspect the kernel calculations will hide the mapping latency, but maybe other memory transaction types must be ckecked too. Thanks a lot!

Topic		Replies	Views
Can zero-copy access potentially save GPU memory? CUDA Programming and Performance	5	5824	September 16, 2009
Global memory access bottleneck CUDA Programming and Performance	8	3455	September 4, 2015
Zero copy & poor performance CUDA Programming and Performance	14	3310	September 16, 2010
Zero Copy VS Page-Locked CUDA Programming and Performance	5	1137	September 19, 2011
May I ask what is the granularity of memory access from GPU for cudamallochost? If I wish to measure its granularity, how should I proceed? CUDA Programming and Performance cuda	13	702	January 3, 2024
How can we efficiently perform batch copy from CPU to GPU, initiated by the CPU?or using an asynchronous approach CUDA Programming and Performance cuda	6	4306	June 17, 2023
Zero-copy memory CUDA Programming and Performance	1	1100	August 31, 2013
Could someone compile simple example for me on the mobile card? CUDA Programming and Performance	20	10176	November 11, 2009
Misaligned Data Access Has No Effect on Performance? CUDA Programming and Performance	7	2158	May 24, 2018
Memory copy for max coalescing CUDA Programming and Performance	11	2564	January 7, 2015

Zero Copy performance problem

Related topics