Zero copy & poor performance

agenda · September 14, 2010, 8:28pm

Hi all,

Can someone explain clearly the situation in which a discrete GPU would benefit from zero copy memory allocation? I have yet to find an example where zero copy results in a speedup. I even ran simpleZeroCopy.cu (found in the 3.1 sdk), timed it, and then re-wrote it using regular device and host allocations, with HtD and DtH mem copies, and it was over twice as fast as the zero copy implementation.

If the SDK example is not the correct application of zero copy memory, then what is (again – for discrete GPUs)? I was running on a 1.3 CC device.

Thanks!

tmurray · September 14, 2010, 8:38pm

Once, long ago, I wrote a very long post explaining the benefits of zero-copy for discrete GPUs. Let’s see…

[url=“http://forums.nvidia.com/index.php?s=&showtopic=92290&view=findpost&p=519529”]The Official NVIDIA Forums | NVIDIA (that was easy, site:forums.nvidia.com on Google is by far the best way to search the forums)

Basically, if you are reading a buffer, operating on every element once, and writing the buffer out, zero-copy is faster (by definition) than manually performing copies because you’re moving data once instead of twice–data streams directly to the SM instead of making unnecessary stops in GPU DRAM.

tmurray · September 14, 2010, 8:38pm

Once, long ago, I wrote a very long post explaining the benefits of zero-copy for discrete GPUs. Let’s see…

[url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?s=&...st&p=519529[/url] (that was easy, site:forums.nvidia.com on Google is by far the best way to search the forums)

Basically, if you are reading a buffer, operating on every element once, and writing the buffer out, zero-copy is faster (by definition) than manually performing copies because you’re moving data once instead of twice–data streams directly to the SM instead of making unnecessary stops in GPU DRAM.

agenda · September 14, 2010, 9:13pm

Hmm shouldn’t the SDK simpleZeroCopy have been an example of this? Or is there a threshold in buffer size where it becomes quicker to use zero copy rather than copying to and from the device?

agenda · September 14, 2010, 9:13pm

Hmm shouldn’t the SDK simpleZeroCopy have been an example of this? Or is there a threshold in buffer size where it becomes quicker to use zero copy rather than copying to and from the device?

tmurray · September 14, 2010, 9:30pm

What are you timing? Just the kernel, or memcpy + kernel + memcpy?

tmurray · September 14, 2010, 9:30pm

What are you timing? Just the kernel, or memcpy + kernel + memcpy?

agenda · September 14, 2010, 9:43pm

For zero copy I timed the kernel with a syncthreads afterwards, and for regular d->h and h->d copies I timed both copies and the kernel, synchronous copies with syncthreads afterwards, and it was still faster.

agenda · September 14, 2010, 9:43pm

For zero copy I timed the kernel with a syncthreads afterwards, and for regular d->h and h->d copies I timed both copies and the kernel, synchronous copies with syncthreads afterwards, and it was still faster.

agenda · September 15, 2010, 8:47am

If it makes any difference, I’m working with a C1060. What I’ve noticed is that as soon as I set the device flag to allow the device to map host memory, I incur a 50%+ performance drop, regardless of whether or not I actually allocate and use zero copy pointers. And then when I do use zero copy (and I believe I am using it for the correct case as you described above), it does not yield any speedups, I’m still stuck at the 50%+ performance drop.

It’s almost as if setting that device flag causes a large amount of overhead for all normal mallocs and cudaMallocs, because I tried to mix and match in my program (have some kernels use regular d<->h copies, and have some kernels use zero copy), but to no avail. So it seems that if you want to set the device flag for mapping to host memory, your program has to 100% utilize zero copy in every instance. But even then, it is still not faster than not using zero copy, at least with this card.

agenda · September 15, 2010, 8:47am

If it makes any difference, I’m working with a C1060. What I’ve noticed is that as soon as I set the device flag to allow the device to map host memory, I incur a 50%+ performance drop, regardless of whether or not I actually allocate and use zero copy pointers. And then when I do use zero copy (and I believe I am using it for the correct case as you described above), it does not yield any speedups, I’m still stuck at the 50%+ performance drop.

It’s almost as if setting that device flag causes a large amount of overhead for all normal mallocs and cudaMallocs, because I tried to mix and match in my program (have some kernels use regular d<->h copies, and have some kernels use zero copy), but to no avail. So it seems that if you want to set the device flag for mapping to host memory, your program has to 100% utilize zero copy in every instance. But even then, it is still not faster than not using zero copy, at least with this card.

avidday · September 15, 2010, 9:17am

That is really strange. My preferred parallel reduction code uses zero-copy to write back partially reduced values directly to host memory for final reduction, and not only is it a bit faster than using an explicit memory copy afterwards, it doesn’t have any effect on the performance of anything else in any of the applications I use it with (on both compute 1.3 and 2.0 cards). What operating system is this under?

avidday · September 15, 2010, 9:17am

That is really strange. My preferred parallel reduction code uses zero-copy to write back partially reduced values directly to host memory for final reduction, and not only is it a bit faster than using an explicit memory copy afterwards, it doesn’t have any effect on the performance of anything else in any of the applications I use it with (on both compute 1.3 and 2.0 cards). What operating system is this under?

agenda · September 16, 2010, 5:04pm

I ran the zero copy example in the ‘cuda by example’ book, ran both their non-zero copy and their zero copy program, and while they claim a 20%+ speedup with the GTX card, I have seen a 50% drop with the C1060.

agenda · September 16, 2010, 5:04pm

I ran the zero copy example in the ‘cuda by example’ book, ran both their non-zero copy and their zero copy program, and while they claim a 20%+ speedup with the GTX card, I have seen a 50% drop with the C1060.