Large CUDA Bandwidth Discrepancy on Identical RTX A4000 GPUs (EPYC 9124 vs. 7343, Supermicro H13SSL-N vs. H12SSL-CT)

Hello,

I’m observing a significant bandwidth difference in CUDA memory transfers between two servers using the same NVIDIA RTX A4000 GPU model.

I ran the bandwidthTest CUDA sample (v12.2) with --memory=pageable on both machines. Here’s a side-by-side comparison:

CUDA Bandwidth Test Results

Transfer Direction Machine A (H12SSL-CT / EPYC 7343) Machine B (H13SSL-N / EPYC 9124)
Host → Device 7.4 GB/s 21.9 GB/s
Device → Host 6.6 GB/s 19.3 GB/s
Device → Device 254.8 GB/s 376.4 GB/s

Result = PASS on both machines
Both use pageable memory, and results are consistent across multiple runs.


Hardware Comparison

Machine A:

  • CPU: AMD EPYC 7343 (16 cores, 32 threads, max 3.94 GHz)
  • Motherboard: Supermicro H12SSL-CT
  • SMBIOS: v3.3

Machine B:

  • CPU: AMD EPYC 9124 (16 cores, 32 threads, max 3.0 GHz)
  • Motherboard: Supermicro H13SSL-N
  • SMBIOS: v3.5

Both have:

  • 1× NVIDIA RTX A4000
  • 32 logical CPUs
  • Boost enabled

Questions

  1. What could explain the 3x Host<->Device bandwidth difference?
  2. Could PCIe link speed or platform/chipset differences (e.g., Gen3 vs. Gen4) be the root cause?
  3. What tools or commands do you recommend to verify the GPU’s PCIe lane width and speed reliably?
  4. Any BIOS or firmware settings I should check that might be capping bandwidth?

Thanks in advance

  1. Could PCIe link speed or platform/chipset differences (e.g., Gen3 vs. Gen4) be the root cause?

Yes. You can run nvidia-smi -q -d pci to see the configuration.

The other issue could be that you are using pageable memory. Please run with pinned memory to see if this rate increases significantly. This can indicate that when performing pageable copies the copy from the pinned bounce buffer to the final pageable buffer is the culprit. You can look up your epic processors and see if you can find the per CPU core maximum copy bandwidth. The copy performance may also scale with the achieved CPU core frequency.

  1. What tools or commands do you recommend to verify the GPU’s PCIe lane width and speed reliably?

nvidia-smi should report this correctly. Please note that on some NVIDIA GPUs the driver dynamic changes the generation and the number of enabled links to save power. I don’t think this is done on the workstation GPUs. It is best to run the query while you are executing the bandwidth test.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.