CPU memory configuration affects GPU to CPU transfer rates

Has any ever seen the system CPU memory configuration making a huge difference on the performance of the transfer rate from GPU to CPU? Before you answer yes (duh!), let me explain that the basic specs of the RAM are the same - 1600 DDR3, ECC, etc. The rank is different, the size is different, the CAS latency may be marginally different. When we run our program, the memory transfer normally takes less than 5ms to complete (it’s a big chunk of data). However, with a little variation in the memory (i.e. rank), we sometimes see the memory transfer take more than 20 ms. That’s a huge difference for such a little change. Anyone have any ideas?

In a transfer between the host and the device (in either direction), the system memory serves as either source or destination, so it is entirely possible that the performance of the system memory can influence the throughput of host/device transfers.

This applies even more when the transfer is from pageable (unpinned) host memory, as such transfers require a copy inside the system memory, from the user memory to a pinned DMA buffer manintained by the driver (or vice versa). When running with PCIe gen3 transfers, it is possible for bandwidth requirements of host/device transfers to exceed available system memory bandwidth.

Additional issues exist in multi-socket systems, where transfers from/to GPU may involve either the “near” or the “far” CPU. Since each CPU has its own memory controller, there is likewise “near” and “far” system memory and transfers to/from the “far” memory would tend to be slower. You would want to carefully control CPU and memory affinity, for example with numactl under Linux.

Note that host/device transfers speeds are dependent on the size of the transferred data, due to fixed overheads. Reaching peak transfer rates typically requires copying chunks of several MB.

You may want to compare the system memory performance in isolation (for example with STREAM) to see whether there are any significant performance differences between the two machines. There could be SBIOS configuration issues on one of the systems that reduce system memory performance.