I have begun diving into the world of pinned memory and have discovered some shocking stuff that I don’t quite understand.
Using my laptop (with PCI 3.0 x16), my GPU has the following speeds for the following test scenarios:
Non-Pinned cudaMemcpy: 4600MB/s
Pinned cudaMemcpy: 9700MB/s
Pinned Kernel Copy: 13300MB/s
The kernel copy refers to a kernel which copies elements from a source to a destination… I can post all the code but I’m not really sure it is necessary; the kernel is just copying memory in a coalesced manner.
My confusion comes from the fact that a simple kernel copy outperforms a standard cuda call. I would assume cudaMemcpy to outperform my kernel copy or at least do equally as well; not be 25% slower.
Is cudaMemcpy not optimized for copy speeds? Is there some power of using memcpy instead of a kernel copy? Or is it more likely that my code is doing something faulty / my timing is off?
I do know that many GPUs are capable of performing cudaMemcpy while running kernels (at the same time); is cudaMemcpy slow because it is made to be able to do this?