Memory Throughput Discrepancies on RTX A4500 with PCIe Gen4

Hello NVIDIA Developer Community,

I am running memory throughput tests on a system equipped with 2 NVIDIA RTX A4500 (Ampere) GPUs and an Intel Xeon CPU. The GPUs are connected via PCIe Gen4 x16, which offers a maximum theoretical throughput of 32 GiB/s. However, based on insights from our electrical engineer and accounting for PCIe switches and other system considerations, we anticipated achieving around 25 GiB/s in practical measurements.

I measured the following transfer speeds:

  • Device-to-Host (d2h): 24.59 GiB/s
  • Host-to-Device (h2d): 19.79 GiB/s
  • Device-to-Device (d2d/p2p): 19.49 GiB/s

For reference, the host memory is allocated using cudaHostMalloc and the GPU memory using cudaMalloc. Additionally, before initiating GPU-to-GPU transfers, peer access is enabled using cudaDeviceEnablePeerAccess.

My question is: What could be causing the lower-than-expected throughput in certain tests (especially h2d and d2d), and are there any strategies or best practices to overcome these limitations to more closely approach the theoretical maximum speeds?

Any insights or suggestions on how to improve these transfer rates would be greatly appreciated.

Thanks in advance,

You don’t mention what you’re using to test with, but if you haven’t already, this utility, which replaces the bandwidth test sample could be worth testing.

Totally in expected range. If your expectations differed, adjust expectations.

A tad on the low side. No information has been presented that would explain why this is not closer to the h2d value, which would be my expectation. It could have something to do with how your hardware is configured or with the measurement methodology.

For reference, what is the Xeon CPU used here? Is this a multi-socket system?

Be careful not to mix up GiB and GB here. While memory sizes are stated in GiB, throughput is commonly measured in GB/sec. PCIe 4 offers 16 GT (giga transfers) per second; for a x16 link this equates to 32 GB/sec. After overhead for error correction (128/130 bit encoding) 31.5 GB/sec remain.

PCIe uses packetized transport, so for practically achievable throughput one needs to consider packet-header overhead vs packet payload size. I believe for GPUs the payload size is 256 bytes, so the maximum throughput that is achievable in theory is about 87% of the 31.5 GB/sec = 27.4 GB/sec. For data transfers from and to the GPU there is also per-transfer overhead, which is why the effective throughput increases with transfer size. Throughput should reach the maximum practically achievable throughput at transfer sizes of several MB.

Generally I would expect measured throughput for a PCIe 4 x16 link directly between a CPU and a GPU to max out in the range of 22 GB/sec to 26 GB/sec.

There could also be some PCIe switches between the two GPUs on your motherboard. With modern systems many PCIe ports are routed through the CPU, but not all.

Try to find out more about your architecture.