Hello,
I’m observing a significant bandwidth difference in CUDA memory transfers between two servers using the same NVIDIA RTX A4000 GPU model.
I ran the bandwidthTest CUDA sample (v12.2) with --memory=pageable on both machines. Here’s a side-by-side comparison:
CUDA Bandwidth Test Results
| Transfer Direction | Machine A (H12SSL-CT / EPYC 7343) | Machine B (H13SSL-N / EPYC 9124) |
|---|---|---|
| Host → Device | 7.4 GB/s | 21.9 GB/s |
| Device → Host | 6.6 GB/s | 19.3 GB/s |
| Device → Device | 254.8 GB/s | 376.4 GB/s |
Result = PASSon both machines
Both use pageable memory, and results are consistent across multiple runs.
Hardware Comparison
Machine A:
- CPU: AMD EPYC 7343 (16 cores, 32 threads, max 3.94 GHz)
- Motherboard: Supermicro H12SSL-CT
- SMBIOS: v3.3
Machine B:
- CPU: AMD EPYC 9124 (16 cores, 32 threads, max 3.0 GHz)
- Motherboard: Supermicro H13SSL-N
- SMBIOS: v3.5
Both have:
- 1× NVIDIA RTX A4000
- 32 logical CPUs
- Boost enabled
Questions
- What could explain the 3x Host<->Device bandwidth difference?
- Could PCIe link speed or platform/chipset differences (e.g., Gen3 vs. Gen4) be the root cause?
- What tools or commands do you recommend to verify the GPU’s PCIe lane width and speed reliably?
- Any BIOS or firmware settings I should check that might be capping bandwidth?
Thanks in advance