Memory Throughput Discrepancies on RTX A4500 with PCIe Gen4

danielattali16 · May 12, 2025, 4:32am

Hello NVIDIA Developer Community,

I am running memory throughput tests on a system equipped with 2 NVIDIA RTX A4500 (Ampere) GPUs and an Intel Xeon CPU. The GPUs are connected via PCIe Gen4 x16, which offers a maximum theoretical throughput of 32 GiB/s. However, based on insights from our electrical engineer and accounting for PCIe switches and other system considerations, we anticipated achieving around 25 GiB/s in practical measurements.

I measured the following transfer speeds:

Device-to-Host (d2h): 24.59 GiB/s
Host-to-Device (h2d): 19.79 GiB/s
Device-to-Device (d2d/p2p): 19.49 GiB/s

For reference, the host memory is allocated using cudaHostMalloc and the GPU memory using cudaMalloc. Additionally, before initiating GPU-to-GPU transfers, peer access is enabled using cudaDeviceEnablePeerAccess.

My question is: What could be causing the lower-than-expected throughput in certain tests (especially h2d and d2d), and are there any strategies or best practices to overcome these limitations to more closely approach the theoretical maximum speeds?

Any insights or suggestions on how to improve these transfer rates would be greatly appreciated.

Thanks in advance,

rs277 · May 12, 2025, 8:22am

You don’t mention what you’re using to test with, but if you haven’t already, this utility, which replaces the bandwidth test sample could be worth testing.

njuffa · May 12, 2025, 2:45pm

Totally in expected range. If your expectations differed, adjust expectations.

A tad on the low side. No information has been presented that would explain why this is not closer to the h2d value, which would be my expectation. It could have something to do with how your hardware is configured or with the measurement methodology.

For reference, what is the Xeon CPU used here? Is this a multi-socket system?

njuffa · May 12, 2025, 6:17pm

Be careful not to mix up GiB and GB here. While memory sizes are stated in GiB, throughput is commonly measured in GB/sec. PCIe 4 offers 16 GT (giga transfers) per second; for a x16 link this equates to 32 GB/sec. After overhead for error correction (128/130 bit encoding) 31.5 GB/sec remain.

PCIe uses packetized transport, so for practically achievable throughput one needs to consider packet-header overhead vs packet payload size. I believe for GPUs the payload size is 256 bytes, so the maximum throughput that is achievable in theory is about 87% of the 31.5 GB/sec = 27.4 GB/sec. For data transfers from and to the GPU there is also per-transfer overhead, which is why the effective throughput increases with transfer size. Throughput should reach the maximum practically achievable throughput at transfer sizes of several MB.

Generally I would expect measured throughput for a PCIe 4 x16 link directly between a CPU and a GPU to max out in the range of 22 GB/sec to 26 GB/sec.

Curefab · May 13, 2025, 4:32am

There could also be some PCIe switches between the two GPUs on your motherboard. With modern systems many PCIe ports are routed through the CPU, but not all.

Try to find out more about your architecture.

Topic		Replies	Views
About Data transfer speed between CPU and GPU? How to increase the data transfer speed? CUDA Programming and Performance	7	15551	December 11, 2009
Optimize data transfer rate from host to device CUDA Programming and Performance	3	2787	July 27, 2017
Why is the transfer throughput low when transferring small size data from Host to Device (or Device to Host)? CUDA Programming and Performance	8	2226	October 12, 2021
Transfer rates for mapped memory is driver involved? CUDA Programming and Performance	6	4107	February 6, 2012
Question about PCI-E transfer throughput CUDA Programming and Performance	13	158	April 5, 2025
Low Memcpy Throughput CUDA Programming and Performance	1	2634	December 14, 2017
The change of speed when copying data between host and device CUDA Programming and Performance pcie , cuda , linux	5	1957	October 12, 2021
What is the full potential of my GPU? CUDA Programming and Performance	9	6188	September 11, 2008
RDMA GPU Direct Slow CUDA Programming and Performance	10	2495	February 13, 2019
Improving data transfer performance from host to device CUDA Programming and Performance	2	2083	January 28, 2015

Memory Throughput Discrepancies on RTX A4500 with PCIe Gen4

Related topics