Kernel Copy vs. cudaMemcpy

I have begun diving into the world of pinned memory and have discovered some shocking stuff that I don’t quite understand.

Using my laptop (with PCI 3.0 x16), my GPU has the following speeds for the following test scenarios:

Non-Pinned cudaMemcpy: 4600MB/s

Pinned cudaMemcpy: 9700MB/s

Pinned Kernel Copy: 13300MB/s

The kernel copy refers to a kernel which copies elements from a source to a destination… I can post all the code but I’m not really sure it is necessary; the kernel is just copying memory in a coalesced manner.

My confusion comes from the fact that a simple kernel copy outperforms a standard cuda call. I would assume cudaMemcpy to outperform my kernel copy or at least do equally as well; not be 25% slower.

Is cudaMemcpy not optimized for copy speeds? Is there some power of using memcpy instead of a kernel copy? Or is it more likely that my code is doing something faulty / my timing is off?

I do know that many GPUs are capable of performing cudaMemcpy while running kernels (at the same time); is cudaMemcpy slow because it is made to be able to do this?

For at least large copies, cudaMemcpy() uses direct-memory-access (DMA) to move data two/from the GPU. Special DMA hardware allows both the CPU and GPU to continue with computations while the transfer takes place. DMA is especially efficient with pinned memory because it avoids an extra memory copy to a temporary memory buffer.

I don’t have any experience with CUDA on PCIe 3.0, but on PCIe 2.0, large DMA transfers get pretty close to peak PCIe 2.0 speeds. This doesn’t appear to be the case for you (peak should be around 15GB/s). I would have expected cudaMemcpy() to at least match your kernel’s performance.

How big are your data transfers? Small DMA copies suffer from proportionally high overheads. DMA usually beats an in-kernel approach for copies greater than about 1MB to 4MB (this is coming from my experience with PCIe 2.0). Have you tried benchmarking your system with the “bandwidthTest” CUDA sample?