Cuda 2.2 / Zero-copy access

deaton · March 17, 2009, 2:56pm

The Cuda 2.2 release notes describe the new zero-copy feature:

Zero-copy access to pinned system memory
- Allows MCP7x and GT200 and later GPUs to use system memory without
  copying to dedicated (video) memory for significant perf improvement.

In what fashion does that memory have to be accessed to achieve good performance?

Current scenario: Now, I have a 3D volume, from which I need to read/write slices from (sent to/from host). I first do a H->D transfer (2D buffer, from pinned memory), and then use a kernel to write that slice properly into the 3D volume.

With zero-copy: I just need to use the one kernel, which reads from pinned host memory directly?

This precludes overlapping… right? If we can hide the H->D latency, then the non-zero-copy approach will still be faster, I believe? (since we’re using full device bandwidth in the kernel, rather than PCIe bandwidth).

Please correct me if I’m wrong.

Thanks :)

Danny

_Big_Mac · March 17, 2009, 5:29pm

Did I miss something? When was 2.2 released?

AndreiB · March 17, 2009, 6:05pm

No, it wasn’t publicly released yet, but it is available for registered developers.

seibert · March 17, 2009, 6:40pm

Are we allowed to talk about the 2.2 release in public forums?

AndreiB · March 17, 2009, 8:22pm

Um, I don’t know. AFAIR 1.1 was discussed here until public release… but maybe it’s a good idea to create restricted access forum for such topics.

Anyway, it’s better to wait until Tim or Simon or someone else from NVIDIA will make things clear…

tmurray · March 17, 2009, 9:42pm

Official answer: yes, you can talk about the 2.2 beta. If you have bug reports, please make sure to file a bug in the registered developer site (in addition to any prodding you want to do on the forums).

Anyway…

Zero-copy is somewhat confusing when you first look at it, but it might be the most powerful thing we’ve exposed in CUDA. Zero-copy plus pinned memory shared across contexts (another magical 2.2 feature) is a giant cannon that somebody is going to use for some ridiculous application.

First, the caveat. CUDA is currently limited to a 32-bit address space, and zero-copy is done per-process, not per-allocation, so any pinned memory allocation will also be a zero-copy allocation (which uses address space) when the appropriate context flag is set. We’re looking at removing this per-context limitation in the future.

Let’s split zero-copy discussion into two separate buckets: MCP79, the easy case, and GT200, the more complicated case.

MCP79: Zero-copy here implies two things–zero-copy and copy elimination. MCP79 will use any memory on the host directly, so this is really good for low-latency applications. There’s no PCIe traffic or anything like that now, sysmem is used directly because MCP79 is the chipset. It’s absolutely ridiculous. The only reason to not use zero-copy on an MCP79 is because of the 32-bit address space limitation, so in reality, you will pretty much always use zero-copy on MCP79. If you are an audio guy, please write something using zero-copy on MCP79–I’ve really wanted to do this, but I haven’t had time. I expect its perf compared to other things in this segment to be mind-blowing.

GT200: The big complicated case.

When you use zero-copy on GT200, the SM will perform a memory fetch across PCIe directly. The accessed area will not touch global memory or anything like that–it goes straight from PCIe into the SM. If you remember TurboCache from the GeForce 6 timeframe, this is a lot like that. Bandwidth between DRAM and PCIe is additive–now you’ve got ~80GB/s of DRAM bandwidth + ~6GB/s of PCIe bandwidth to play with on a GT200.

However, there’s another side to zero-copy–latency. To answer the OP’s question, you can never totally hide PCIe latency. Even if you’ve got perfect overlap and all of your cudaMemcpyAsyncs are hidden by kernel executions, you still have the initial memcpys to the device before you can start executing (plus the last memcpy you have to do). Zero-copy may be faster for these things–depends on your access pattern and any number of variables. Our internal tests have shown that while kernel execution time certainly does increase versus accessing everything in DRAM, the fact that you are doing this in the SM, which is a device whose fundamental task is to hide memory latency while doing computation, can give you really effective latency hiding, so it can offer surprising performance advantages. I’ve been trying to get a week free to bang on it and figure out when exactly it’s useful (e.g., I imagine it’s quite useful in some BLAS calls when you’re limited by memory bandwidth to begin with), so I’m very interested in what people discover with it.

PS: you can do cudaMemcpyAsync and zero-copy at the same time. They will slow each other down since you’ve only got so much PCIe bandwidth to play with in the first place, but something to keep in mind…

PPS: also keep in mind that there are all sorts of read-after-write hazards associated with zero-copy. If you write to the region on the CPU and expect it to be immediately visible to the GPU, this is probably PCIe controller dependent. Same going in the other direction. The only thing we guarantee is that if you write to a PCIe location in one thread and read it later from that same thread, you’ll see the updated value.

MisterAnderson42 · March 17, 2009, 10:06pm

Oh yes. I’m already rethinking through every host->device interaction in my codes trying to decide whether zero-copy might be useful. Too bad I’m stuck writing my thesis for the next couple weeks and won’t be able to play for a while :(

The 2D texture fetching straight from device memory is also mind blowing, but that is a topic for another thread.

On topic, I think the only thing you left out in your detailed post, Tim, is what kind of access patterns are ideal? I.e. is it best for a warp to access zero-copy memory with an kind of locality? Or does it not really matter?

tmurray · March 17, 2009, 10:08pm

Locality is very important, as there’s some minimum PCIe burst length (but I don’t know what it is). I think it will end up looking a lot like GT200’s coalescing, since the PCIe controller should combine a lot of different transactions into a single burst.

seibert · March 17, 2009, 10:46pm

Wow. Once CUDA 2.2 for OS X appears, it might be time to upgrade my MacBook…

tmurray · March 17, 2009, 11:53pm

CUDA 2.2 will come out for OSX. I think it might even appear simultaneously with 2.2 final for other OSes…

pvouzis · March 18, 2009, 2:19am

When is CUDA 2.2 going to be released? In a week, in three weeks, in a month? I read somewhere that CUDA 2.2 was/is scheduled for Q1 2009, and I desperately need it because a bug related to cuda-gdb will supposedly be fixed.

BTW, how can I become a registered developer?

tmurray · March 18, 2009, 3:02am

[url=“NVIDIA Developer Program | NVIDIA Developer”]http://developer.nvidia.com/page/registere...er_program.html[/url]

wumpus · March 18, 2009, 7:40pm

It seems that now the basic stuff is ironed out, the more advanced and powerful features are being introduced into CUDA, this is great. I just have to get a GT200 system now :)

By the way, are there plans to support other chipsets? I have a MCP55, it seems that MCP73 is Intel only…

E.D_Riedijk · March 18, 2009, 11:07pm

Just read (tweakers.net) that NVIDIA is coming later this month with a chipset for phenomII that also can have an IGP.

Quoc_Vinh · March 23, 2009, 9:41am

As far as I understand, when allocate device memory, CUDA not warranty that memory will be set zero.
In some case, before use that memory, I must use cudaMemset() for setting zero.
In cuda 2.2 beta, with “Zero-copy”, I don’t need to use cudaMemset()?
If yes, can “Zero-copy” works on device with compute capability 1.0?
Currently I using Cuda 2.0.
:)

E.D_Riedijk · March 23, 2009, 10:15am

If you are copying data from host to device, why would you need to memset the memory on the device before???

MisterAnderson42 · March 23, 2009, 12:18pm

?? You still have to memset memory. I’m not aware of any modern OS that will clear memory for you when allocated.

Nope. Read the post above for a list of the hardware that supports it.

tmurray · March 23, 2009, 11:45pm

Update: apparently, atomics do work with zero-copy memory. I was wrong!

SPWorley · March 24, 2009, 12:06am

I would assume this was true ONLY for GPU writes alone, or CPU writes alone.

I’d bet cash that mixing CPU and GPU atomics would be a huge barrel of intertwined synchronization pain that should not even be considered.

Which is a shame, since zero-copy really lends itself to having the CPU “chip in” as a worker along with the GPU. Using atomics to synchronize the work partitioning would be useful. But I’m not complaining, zero-copy is just the bee’s knees and I have a big smile on my face.

Quoc_Vinh · March 24, 2009, 12:51am

Thank for your answer, I am sorry I did’t read clearly.

:)