Zero copy & poor performance

Hi all,

Can someone explain clearly the situation in which a discrete GPU would benefit from zero copy memory allocation? I have yet to find an example where zero copy results in a speedup. I even ran simpleZeroCopy.cu (found in the 3.1 sdk), timed it, and then re-wrote it using regular device and host allocations, with HtD and DtH mem copies, and it was over twice as fast as the zero copy implementation.

If the SDK example is not the correct application of zero copy memory, then what is (again – for discrete GPUs)? I was running on a 1.3 CC device.

Thanks!

Once, long ago, I wrote a very long post explaining the benefits of zero-copy for discrete GPUs. Let’s see…

http://forums.nvidia.com/index.php?s=&…st&p=519529 (that was easy, site:forums.nvidia.com on Google is by far the best way to search the forums)

Basically, if you are reading a buffer, operating on every element once, and writing the buffer out, zero-copy is faster (by definition) than manually performing copies because you’re moving data once instead of twice–data streams directly to the SM instead of making unnecessary stops in GPU DRAM.

Once, long ago, I wrote a very long post explaining the benefits of zero-copy for discrete GPUs. Let’s see…

http://forums.nvidia.com/index.php?s=&…st&p=519529 (that was easy, site:forums.nvidia.com on Google is by far the best way to search the forums)

Basically, if you are reading a buffer, operating on every element once, and writing the buffer out, zero-copy is faster (by definition) than manually performing copies because you’re moving data once instead of twice–data streams directly to the SM instead of making unnecessary stops in GPU DRAM.

Hmm shouldn’t the SDK simpleZeroCopy have been an example of this? Or is there a threshold in buffer size where it becomes quicker to use zero copy rather than copying to and from the device?

Hmm shouldn’t the SDK simpleZeroCopy have been an example of this? Or is there a threshold in buffer size where it becomes quicker to use zero copy rather than copying to and from the device?

What are you timing? Just the kernel, or memcpy + kernel + memcpy?

What are you timing? Just the kernel, or memcpy + kernel + memcpy?

For zero copy I timed the kernel with a syncthreads afterwards, and for regular d->h and h->d copies I timed both copies and the kernel, synchronous copies with syncthreads afterwards, and it was still faster.

For zero copy I timed the kernel with a syncthreads afterwards, and for regular d->h and h->d copies I timed both copies and the kernel, synchronous copies with syncthreads afterwards, and it was still faster.

If it makes any difference, I’m working with a C1060. What I’ve noticed is that as soon as I set the device flag to allow the device to map host memory, I incur a 50%+ performance drop, regardless of whether or not I actually allocate and use zero copy pointers. And then when I do use zero copy (and I believe I am using it for the correct case as you described above), it does not yield any speedups, I’m still stuck at the 50%+ performance drop.

It’s almost as if setting that device flag causes a large amount of overhead for all normal mallocs and cudaMallocs, because I tried to mix and match in my program (have some kernels use regular d<->h copies, and have some kernels use zero copy), but to no avail. So it seems that if you want to set the device flag for mapping to host memory, your program has to 100% utilize zero copy in every instance. But even then, it is still not faster than not using zero copy, at least with this card.

If it makes any difference, I’m working with a C1060. What I’ve noticed is that as soon as I set the device flag to allow the device to map host memory, I incur a 50%+ performance drop, regardless of whether or not I actually allocate and use zero copy pointers. And then when I do use zero copy (and I believe I am using it for the correct case as you described above), it does not yield any speedups, I’m still stuck at the 50%+ performance drop.

It’s almost as if setting that device flag causes a large amount of overhead for all normal mallocs and cudaMallocs, because I tried to mix and match in my program (have some kernels use regular d<->h copies, and have some kernels use zero copy), but to no avail. So it seems that if you want to set the device flag for mapping to host memory, your program has to 100% utilize zero copy in every instance. But even then, it is still not faster than not using zero copy, at least with this card.

That is really strange. My preferred parallel reduction code uses zero-copy to write back partially reduced values directly to host memory for final reduction, and not only is it a bit faster than using an explicit memory copy afterwards, it doesn’t have any effect on the performance of anything else in any of the applications I use it with (on both compute 1.3 and 2.0 cards). What operating system is this under?

That is really strange. My preferred parallel reduction code uses zero-copy to write back partially reduced values directly to host memory for final reduction, and not only is it a bit faster than using an explicit memory copy afterwards, it doesn’t have any effect on the performance of anything else in any of the applications I use it with (on both compute 1.3 and 2.0 cards). What operating system is this under?

I ran the zero copy example in the ‘cuda by example’ book, ran both their non-zero copy and their zero copy program, and while they claim a 20%+ speedup with the GTX card, I have seen a 50% drop with the C1060.

I ran the zero copy example in the ‘cuda by example’ book, ran both their non-zero copy and their zero copy program, and while they claim a 20%+ speedup with the GTX card, I have seen a 50% drop with the C1060.