Large CUDA Bandwidth Discrepancy on Identical RTX A4000 GPUs (EPYC 9124 vs. 7343, Supermicro H13SSL-N vs. H12SSL-CT)

nelson2 · July 15, 2025, 4:35pm

Hello,

I’m observing a significant bandwidth difference in CUDA memory transfers between two servers using the same NVIDIA RTX A4000 GPU model.

I ran the bandwidthTest CUDA sample (v12.2) with --memory=pageable on both machines. Here’s a side-by-side comparison:

CUDA Bandwidth Test Results

Transfer Direction	Machine A (H12SSL-CT / EPYC 7343)	Machine B (H13SSL-N / EPYC 9124)
Host → Device	7.4 GB/s	21.9 GB/s
Device → Host	6.6 GB/s	19.3 GB/s
Device → Device	254.8 GB/s	376.4 GB/s

Result = PASS on both machines
Both use pageable memory, and results are consistent across multiple runs.

Hardware Comparison

Machine A:

CPU: AMD EPYC 7343 (16 cores, 32 threads, max 3.94 GHz)
Motherboard: Supermicro H12SSL-CT
SMBIOS: v3.3

Machine B:

CPU: AMD EPYC 9124 (16 cores, 32 threads, max 3.0 GHz)
Motherboard: Supermicro H13SSL-N
SMBIOS: v3.5

Both have:

1× NVIDIA RTX A4000
32 logical CPUs
Boost enabled

Questions

What could explain the 3x Host<->Device bandwidth difference?
Could PCIe link speed or platform/chipset differences (e.g., Gen3 vs. Gen4) be the root cause?
What tools or commands do you recommend to verify the GPU’s PCIe lane width and speed reliably?
Any BIOS or firmware settings I should check that might be capping bandwidth?

Thanks in advance

Greg · July 17, 2025, 10:50pm

Could PCIe link speed or platform/chipset differences (e.g., Gen3 vs. Gen4) be the root cause?

Yes. You can run nvidia-smi -q -d pci to see the configuration.

The other issue could be that you are using pageable memory. Please run with pinned memory to see if this rate increases significantly. This can indicate that when performing pageable copies the copy from the pinned bounce buffer to the final pageable buffer is the culprit. You can look up your epic processors and see if you can find the per CPU core maximum copy bandwidth. The copy performance may also scale with the achieved CPU core frequency.

What tools or commands do you recommend to verify the GPU’s PCIe lane width and speed reliably?

nvidia-smi should report this correctly. Please note that on some NVIDIA GPUs the driver dynamic changes the generation and the number of enabled links to save power. I don’t think this is done on the workstation GPUs. It is best to run the query while you are executing the bandwidth test.

veraj · August 16, 2025, 10:50pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Bandwidth problem ? Could anyone verify that this is normal? CUDA Programming and Performance	7	3648	April 25, 2008
Host<-> device bandwidth problems slow and intermittent bandwidth on linux CUDA Programming and Performance	9	6791	January 8, 2008
Low host<->device bandwidth for one of two cards CUDA Programming and Performance	1	6504	March 22, 2011
Memory bandwidth CUDA Programming and Performance	31	38665	October 5, 2007
Host to Device Memroy Bandwidth CUDA Programming and Performance	18	8101	September 12, 2008
Bandwidht Usage CUDA Programming and Performance	16	9014	October 30, 2008
GTX480 performance on different motherboards performance differs on AMD and INTEL motherboards CUDA Programming and Performance	15	18485	June 7, 2010
Host to Device Bandwidth and PCEe 2.0 - not getting what I should! CUDA Programming and Performance	4	2758	February 18, 2009
bandwith performance on PCI-E v1 slow? CUDA Programming and Performance	3	902	May 15, 2008
Bandwidth CUDA Programming and Performance	2	3783	May 25, 2007