Can zero-copy access potentially save GPU memory?

_device · August 21, 2009, 3:17am

I have a kernel that needs to store ~1.5GB in the GPU memory, which then needs to be copied to the host. Since my cards have less than 1GB, I’m obviously getting the ‘out of memory’ error. Have rewritten the stuff to use zero-copy, in hope that less memory will be used on GPU, but still getting the same error…

Is zero-copying a wrong way to go in such situations?

There is an obvious solution to split one kernel call into several calls and move the data by chunks but wondering if possibly there are other workarounds

SPWorley · August 21, 2009, 4:50am

Yes, zero copy will indeed give you “more memory”. It will work.

But remember that zero copy memory behaves differently than global memory, especially in latency. The query and response have to be sent down this narrow little PCIe bus, and even if you don’t use much of the bandwidth, the latency is much much greater than normal global memory. A good GPU has over 100GB/sec memory bandwidth… the PCIe transfer will give you at best 5, likely 3. Device memory latency is 400-1000 clocks. PCIe latency… geez, I’m scared to even estimate it, but it’s likely more than an order of magnitude slower.

Still, this is not to trivialize the wonderful flexibility of zero-copy. It would be especially good for data that you don’t need to access in a fat stream, but just have occasional scattered queries from a giant dataset or something.

It’s especially useful for writing results, though, since those are fire-and-forget (no latency to worry about), and you even have the nice convenience that your answers are already back home on the host without you needing to do post-kernel copies or anything.

_device · August 21, 2009, 5:33pm

Yes, zero copy will indeed give you “more memory”. It will work.

But remember that zero copy memory behaves differently than global memory, especially in latency. The query and response have to be sent down this narrow little PCIe bus, and even if you don’t use much of the bandwidth, the latency is much much greater than normal global memory. A good GPU has over 100GB/sec memory bandwidth… the PCIe transfer will give you at best 5, likely 3. Device memory latency is 400-1000 clocks. PCIe latency… geez, I’m scared to even estimate it, but it’s likely more than an order of magnitude slower.

Still, this is not to trivialize the wonderful flexibility of zero-copy. It would be especially good for data that you don’t need to access in a fat stream, but just have occasional scattered queries from a giant dataset or something.

It’s especially useful for writing results, though, since those are fire-and-forget (no latency to worry about), and you even have the nice convenience that your answers are already back home on the host without you needing to do post-kernel copies or anything.

Hi SPWorley,

Thanks for the reply. Indeed, I’m a bit lucky because my kernel spends most of the time calculating, and writing the final result takes actually nothing with respect to the total execution time (I even don’t have to bother about coalescing when writing the result). I tried zero-copy for smaller-size systems that could fit into the GPU memory, and the kernel took only somewhat 1% longer. I’m pretty sure the zero copy latencies won’t hit performance of the code in my case.

But I also thought zero copy would save me some memory but unfortunately it didn’t work out. Can memory write pattern affect this? I.e. if writes are scattered over the whole array (as they’re in my case) vs contiguous writes?

_Big_Mac · August 21, 2009, 6:58pm

AFAIK with zero copy, the data gets directly into registers bypassing device memory so technically you should be able to address 4GB of CPU memory from your kernel (since todays GPUs use 32bit pointers).

It’s a good question about how it works when accesses are scattered. PCI-E logic should sort of “coalesce” data into bursts but it would be great to hear how we should understand it - is there any locality to it? Some caching thing? Any guidelines for scattered vs coalesced zero-copy accesses?

I get the impression that zero-copy memory is still in infancy, judging by available documentation, and it shouldn’t be so as it’s very cool. We need more info on this :)

MMB · September 15, 2009, 1:19pm

Yes, zero copy will indeed give you “more memory”. It will work.

But remember that zero copy memory behaves differently than global memory, especially in latency. The query and response have to be sent down this narrow little PCIe bus, and even if you don’t use much of the bandwidth, the latency is much much greater than normal global memory. A good GPU has over 100GB/sec memory bandwidth… the PCIe transfer will give you at best 5, likely 3. Device memory latency is 400-1000 clocks. PCIe latency… geez, I’m scared to even estimate it, but it’s likely more than an order of magnitude slower.

Still, this is not to trivialize the wonderful flexibility of zero-copy. It would be especially good for data that you don’t need to access in a fat stream, but just have occasional scattered queries from a giant dataset or something.

It’s especially useful for writing results, though, since those are fire-and-forget (no latency to worry about), and you even have the nice convenience that your answers are already back home on the host without you needing to do post-kernel copies or anything.

Hi SPWorley. I don’t understand your response on this issue. How else can the GPU communicate other than via the PCIe bus? Would you elaborate please.

MMB

Sarnath · September 16, 2009, 8:50am

Zero-copy may eat your device-address space…

Let us say the size of Global memory is G
Let us say the size of zero-copy memory is Z

If (G + Z <= 4GB) — Zero copy will NOT eat your address space.
If (G + Z > 4GB) — zero copy will eat your G address space (i.e. cudaMalloc() subsystem will now have lesser memory to manage)

Disclaimer:
I disclaim all that I said above. Coz, NVs implementation could be different.

Topic		Replies	Views
Zero Copy performance problem CUDA Programming and Performance	4	2093	July 6, 2021
How can we efficiently perform batch copy from CPU to GPU, initiated by the CPU?or using an asynchronous approach CUDA Programming and Performance cuda	6	4504	June 17, 2023
Synchronizing mapped memory use between host and device while kernel is running CUDA Programming and Performance	16	13239	June 4, 2010
Zero copy & poor performance CUDA Programming and Performance	14	3342	September 16, 2010
Slow Memory Copies CUDA Programming and Performance	7	1202	November 6, 2018
does anybody have experience on cudaHostRegister zero copy memory CUDA Programming and Performance	8	14461	May 21, 2011
Cuda 2.2 / Zero-copy access CUDA Programming and Performance	33	42265	May 1, 2009
Could someone compile simple example for me on the mobile card? CUDA Programming and Performance	20	10200	November 11, 2009
Question about performance of zero-copy CUDA Programming and Performance	1	741	January 15, 2015
GPU Communication Protocol CUDA Programming and Performance	16	6299	May 17, 2010

Can zero-copy access potentially save GPU memory?

Related topics