What is the CUDA memcpy hardware bottleneck?

I am a beginner in programing with CUDA. I noticed that CUDA memcpy operations (host to device vice versa) take by far the most processing time (up to 4ms in 6ms image processing). I was wondering how different geforce generations effect this copy time and wheteher i.e. a new rtx 20xx card could speed up this transfer time or if it is bound to the PCIe 3.0 x16 or the RAM bandwith. What would be the best memcpy Setup i can build with current hardware available?

For host->device or device->host data copying, the bottleneck is PCI express on all CUDA GPUs in x86-64 platforms, including the new GeForce RTX GPUs. PCIE Gen3 is the best hardware support available today, including on the GeForce RTX GPUs.

One method to minimize the impact of h->d or d->h transfer times on an algorithm is to use a pipelined algorithm:

https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/

so that the data copy operations can be overlapped with the compute operations. Not all algorithms will benefit equally from this approach.

1 Like

In practical terms, make sure your GPUs are in the correct PCIe slots. You should see unidirectional throughput of 11+ GB/sec for a properly configured PCIe gen3 x16 link for the large blocks of ~50MB you appear to be copying.

There is a new PCIe4 standard that offers higher throughput, but best I know the only systems that will offer support for that in the near future are some IBM Power9 based servers shipping later this year.

Whether and when PCIe4 will make an appearance in consumer-grade hardware is anyone’s guess. The additional costs are not negligible, from what I understand (like $100+ per motherboard).