Kernel Copy vs. cudaMemcpy

fmwyso · January 4, 2014, 10:39am

I have begun diving into the world of pinned memory and have discovered some shocking stuff that I don’t quite understand.

Using my laptop (with PCI 3.0 x16), my GPU has the following speeds for the following test scenarios:

Non-Pinned cudaMemcpy: 4600MB/s

Pinned cudaMemcpy: 9700MB/s

Pinned Kernel Copy: 13300MB/s

The kernel copy refers to a kernel which copies elements from a source to a destination… I can post all the code but I’m not really sure it is necessary; the kernel is just copying memory in a coalesced manner.

My confusion comes from the fact that a simple kernel copy outperforms a standard cuda call. I would assume cudaMemcpy to outperform my kernel copy or at least do equally as well; not be 25% slower.

Is cudaMemcpy not optimized for copy speeds? Is there some power of using memcpy instead of a kernel copy? Or is it more likely that my code is doing something faulty / my timing is off?

I do know that many GPUs are capable of performing cudaMemcpy while running kernels (at the same time); is cudaMemcpy slow because it is made to be able to do this?

Arakageeta · January 19, 2014, 5:58am

For at least large copies, cudaMemcpy() uses direct-memory-access (DMA) to move data two/from the GPU. Special DMA hardware allows both the CPU and GPU to continue with computations while the transfer takes place. DMA is especially efficient with pinned memory because it avoids an extra memory copy to a temporary memory buffer.

I don’t have any experience with CUDA on PCIe 3.0, but on PCIe 2.0, large DMA transfers get pretty close to peak PCIe 2.0 speeds. This doesn’t appear to be the case for you (peak should be around 15GB/s). I would have expected cudaMemcpy() to at least match your kernel’s performance.

How big are your data transfers? Small DMA copies suffer from proportionally high overheads. DMA usually beats an in-kernel approach for copies greater than about 1MB to 4MB (this is coming from my experience with PCIe 2.0). Have you tried benchmarking your system with the “bandwidthTest” CUDA sample?

Topic		Replies	Views
Slow memory transfers CUDA Programming and Performance	7	1985	May 23, 2011
Page Locked Memory CUDA Programming and Performance	3	975	May 5, 2011
copy memory slow? CUDA Programming and Performance	2	4798	February 12, 2009
Host to Device memcpy overhead CUDA Programming and Performance	2	1142	March 17, 2009
Memory copy improvement ? CUDA Programming and Performance	6	3072	April 25, 2012
Is cudaMemcpyAsync inside a kernel controlled by the GPU? CUDA Programming and Performance	9	3337	July 28, 2019
Why is there the difference of memory copy speed between cpu>gpu and gpu>cpu CUDA Programming and Performance	3	1272	April 10, 2014
Memory Transfer CUDA Programming and Performance	7	2959	October 10, 2008
Copies between CPU and GPU CUDA Programming and Performance	8	5317	November 3, 2009
Problem with cudamemcopy CUDA Programming and Performance	6	1829	September 18, 2009

Kernel Copy vs. cudaMemcpy

Related Topics