Cuda 2.2 / Zero-copy access

The Cuda 2.2 release notes describe the new zero-copy feature:

  • Zero-copy access to pinned system memory
    • Allows MCP7x and GT200 and later GPUs to use system memory without
      copying to dedicated (video) memory for significant perf improvement.

In what fashion does that memory have to be accessed to achieve good performance?

Current scenario: Now, I have a 3D volume, from which I need to read/write slices from (sent to/from host). I first do a H->D transfer (2D buffer, from pinned memory), and then use a kernel to write that slice properly into the 3D volume.

With zero-copy: I just need to use the one kernel, which reads from pinned host memory directly?

This precludes overlapping… right? If we can hide the H->D latency, then the non-zero-copy approach will still be faster, I believe? (since we’re using full device bandwidth in the kernel, rather than PCIe bandwidth).

Please correct me if I’m wrong.

Thanks :)


Did I miss something? When was 2.2 released?

No, it wasn’t publicly released yet, but it is available for registered developers.

Are we allowed to talk about the 2.2 release in public forums?

Um, I don’t know. AFAIR 1.1 was discussed here until public release… but maybe it’s a good idea to create restricted access forum for such topics.

Anyway, it’s better to wait until Tim or Simon or someone else from NVIDIA will make things clear…

Official answer: yes, you can talk about the 2.2 beta. If you have bug reports, please make sure to file a bug in the registered developer site (in addition to any prodding you want to do on the forums).


Zero-copy is somewhat confusing when you first look at it, but it might be the most powerful thing we’ve exposed in CUDA. Zero-copy plus pinned memory shared across contexts (another magical 2.2 feature) is a giant cannon that somebody is going to use for some ridiculous application.

First, the caveat. CUDA is currently limited to a 32-bit address space, and zero-copy is done per-process, not per-allocation, so any pinned memory allocation will also be a zero-copy allocation (which uses address space) when the appropriate context flag is set. We’re looking at removing this per-context limitation in the future.

Let’s split zero-copy discussion into two separate buckets: MCP79, the easy case, and GT200, the more complicated case.

MCP79: Zero-copy here implies two things–zero-copy and copy elimination. MCP79 will use any memory on the host directly, so this is really good for low-latency applications. There’s no PCIe traffic or anything like that now, sysmem is used directly because MCP79 is the chipset. It’s absolutely ridiculous. The only reason to not use zero-copy on an MCP79 is because of the 32-bit address space limitation, so in reality, you will pretty much always use zero-copy on MCP79. If you are an audio guy, please write something using zero-copy on MCP79–I’ve really wanted to do this, but I haven’t had time. I expect its perf compared to other things in this segment to be mind-blowing.

GT200: The big complicated case.

When you use zero-copy on GT200, the SM will perform a memory fetch across PCIe directly. The accessed area will not touch global memory or anything like that–it goes straight from PCIe into the SM. If you remember TurboCache from the GeForce 6 timeframe, this is a lot like that. Bandwidth between DRAM and PCIe is additive–now you’ve got ~80GB/s of DRAM bandwidth + ~6GB/s of PCIe bandwidth to play with on a GT200.

However, there’s another side to zero-copy–latency. To answer the OP’s question, you can never totally hide PCIe latency. Even if you’ve got perfect overlap and all of your cudaMemcpyAsyncs are hidden by kernel executions, you still have the initial memcpys to the device before you can start executing (plus the last memcpy you have to do). Zero-copy may be faster for these things–depends on your access pattern and any number of variables. Our internal tests have shown that while kernel execution time certainly does increase versus accessing everything in DRAM, the fact that you are doing this in the SM, which is a device whose fundamental task is to hide memory latency while doing computation, can give you really effective latency hiding, so it can offer surprising performance advantages. I’ve been trying to get a week free to bang on it and figure out when exactly it’s useful (e.g., I imagine it’s quite useful in some BLAS calls when you’re limited by memory bandwidth to begin with), so I’m very interested in what people discover with it.

PS: you can do cudaMemcpyAsync and zero-copy at the same time. They will slow each other down since you’ve only got so much PCIe bandwidth to play with in the first place, but something to keep in mind…

PPS: also keep in mind that there are all sorts of read-after-write hazards associated with zero-copy. If you write to the region on the CPU and expect it to be immediately visible to the GPU, this is probably PCIe controller dependent. Same going in the other direction. The only thing we guarantee is that if you write to a PCIe location in one thread and read it later from that same thread, you’ll see the updated value.

Oh yes. I’m already rethinking through every host->device interaction in my codes trying to decide whether zero-copy might be useful. Too bad I’m stuck writing my thesis for the next couple weeks and won’t be able to play for a while :(

The 2D texture fetching straight from device memory is also mind blowing, but that is a topic for another thread.

On topic, I think the only thing you left out in your detailed post, Tim, is what kind of access patterns are ideal? I.e. is it best for a warp to access zero-copy memory with an kind of locality? Or does it not really matter?

Locality is very important, as there’s some minimum PCIe burst length (but I don’t know what it is). I think it will end up looking a lot like GT200’s coalescing, since the PCIe controller should combine a lot of different transactions into a single burst.

Wow. Once CUDA 2.2 for OS X appears, it might be time to upgrade my MacBook…

CUDA 2.2 will come out for OSX. I think it might even appear simultaneously with 2.2 final for other OSes…

When is CUDA 2.2 going to be released? In a week, in three weeks, in a month? I read somewhere that CUDA 2.2 was/is scheduled for Q1 2009, and I desperately need it because a bug related to cuda-gdb will supposedly be fixed.

BTW, how can I become a registered developer?…er_program.html

It seems that now the basic stuff is ironed out, the more advanced and powerful features are being introduced into CUDA, this is great. I just have to get a GT200 system now :)

By the way, are there plans to support other chipsets? I have a MCP55, it seems that MCP73 is Intel only…

Just read ( that NVIDIA is coming later this month with a chipset for phenomII that also can have an IGP.

As far as I understand, when allocate device memory, CUDA not warranty that memory will be set zero.
In some case, before use that memory, I must use cudaMemset() for setting zero.
In cuda 2.2 beta, with “Zero-copy”, I don’t need to use cudaMemset()?
If yes, can “Zero-copy” works on device with compute capability 1.0?
Currently I using Cuda 2.0.

If you are copying data from host to device, why would you need to memset the memory on the device before???

?? You still have to memset memory. I’m not aware of any modern OS that will clear memory for you when allocated.

Nope. Read the post above for a list of the hardware that supports it.

Update: apparently, atomics do work with zero-copy memory. I was wrong!

I would assume this was true ONLY for GPU writes alone, or CPU writes alone.

I’d bet cash that mixing CPU and GPU atomics would be a huge barrel of intertwined synchronization pain that should not even be considered.

Which is a shame, since zero-copy really lends itself to having the CPU “chip in” as a worker along with the GPU. Using atomics to synchronize the work partitioning would be useful. But I’m not complaining, zero-copy is just the bee’s knees and I have a big smile on my face.

Thank for your answer, I am sorry I did’t read clearly.