I have a kernel that needs to store ~1.5GB in the GPU memory, which then needs to be copied to the host. Since my cards have less than 1GB, I’m obviously getting the ‘out of memory’ error. Have rewritten the stuff to use zero-copy, in hope that less memory will be used on GPU, but still getting the same error…
Is zero-copying a wrong way to go in such situations?
There is an obvious solution to split one kernel call into several calls and move the data by chunks but wondering if possibly there are other workarounds
Yes, zero copy will indeed give you “more memory”. It will work.
But remember that zero copy memory behaves differently than global memory, especially in latency. The query and response have to be sent down this narrow little PCIe bus, and even if you don’t use much of the bandwidth, the latency is much much greater than normal global memory. A good GPU has over 100GB/sec memory bandwidth… the PCIe transfer will give you at best 5, likely 3. Device memory latency is 400-1000 clocks. PCIe latency… geez, I’m scared to even estimate it, but it’s likely more than an order of magnitude slower.
Still, this is not to trivialize the wonderful flexibility of zero-copy. It would be especially good for data that you don’t need to access in a fat stream, but just have occasional scattered queries from a giant dataset or something.
It’s especially useful for writing results, though, since those are fire-and-forget (no latency to worry about), and you even have the nice convenience that your answers are already back home on the host without you needing to do post-kernel copies or anything.
Thanks for the reply. Indeed, I’m a bit lucky because my kernel spends most of the time calculating, and writing the final result takes actually nothing with respect to the total execution time (I even don’t have to bother about coalescing when writing the result). I tried zero-copy for smaller-size systems that could fit into the GPU memory, and the kernel took only somewhat 1% longer. I’m pretty sure the zero copy latencies won’t hit performance of the code in my case.
But I also thought zero copy would save me some memory but unfortunately it didn’t work out. Can memory write pattern affect this? I.e. if writes are scattered over the whole array (as they’re in my case) vs contiguous writes?
AFAIK with zero copy, the data gets directly into registers bypassing device memory so technically you should be able to address 4GB of CPU memory from your kernel (since todays GPUs use 32bit pointers).
It’s a good question about how it works when accesses are scattered. PCI-E logic should sort of “coalesce” data into bursts but it would be great to hear how we should understand it - is there any locality to it? Some caching thing? Any guidelines for scattered vs coalesced zero-copy accesses?
I get the impression that zero-copy memory is still in infancy, judging by available documentation, and it shouldn’t be so as it’s very cool. We need more info on this :)