I’m searching about cutting memcpy time on my CUDA programs.
Memory transfer between host (DRAM) and device (RTX 3090) go through PCIe slot. PCIe 3.0 transfers data at 16 GT/s, and DDR4-2666 memory transfers data at ~2.6 GT/s. Then can I speed up memcpy time by just upgrading my DDR4-2666 to DDR5-4800 (~4.8 GT/s)?
Is there any additional constraints?
You need to consider the bandwidth of each interface (bytes/s). The bandwidth of the DRAM on your system is already far larger than the bandwidth of the PCIE 3.0 interface (about 12GB/s achievable, in practice.) Increasing the speed of your system DRAM is not likely to have any impact.
The only exception might be if your code is doing transfers at the same time that other CPU threads are making intensive use of the system DRAM.
I would expect a minor incremental performance benefit for transfers from and to pageable system memory, due to the system-memory-to-system-memory copy step that it involves. But since you seem to be focused on performance, your application is likely already using pinned rather than pageable memory, making throughput solely limited by the transfer across PCIe.
I concur with Robert_Crovella that you might observe a performance benefit even with pinned memory in exceptional circumstances, e.g. when system memory is being hammered with other activity (either CPU initiated or transfers to/from other GPUs or high-speed I/O devices). I do not recall ever observing such a situation with recent x86-64 systems that use high-speed DDR4, but it seems possible in principle.
Note that PCIe transfer is packetized at various protocol levels, and therefore total effective throughput is lower when transferring in small chunks instead of large chunks. Close to maximum possible PCIe throughput is usually achieved only when individual transfers reach on the order of a million bytes. That might influence how you structure transfers across PCIe in your code.
It may also be worth considering simple forms of data compression when moving data across PCIe, then decompress on the GPU if necessary. This approach may be as simple as choosing the narrowest suitable data type, e.g. choosing
float instead of
double. A lot of data provided by real-life data sources such as sensors tends to require a relatively small number of bits, as A/D converters typically spit out only 12-16 bits.