PCIe bandwidth issue: H2D very slow (Gen1), but D2H reaches Gen4

Hello,

I am running deep learning training on a server and noticed that my DataLoader is very slow. After some checks, I found strange PCIe bandwidth behavior.

Environment:

  • Server: WS940

  • CPU: supports PCIe 64

  • Motherboard: PCIe 4.0 Gen4 supported

  • GPU: NVIDIA GeForce RTX 30902 + A6000*2

  • CUDA version: 12.4

  • Driver: [please fill in your driver version]

  • OS: [Ubuntu 24.04]

Before testing, I adjusted the BIOS settings so that SLOT1,3,5,7 are set to Gen4 and enabled Re-Train.

Observation:

  • Before loading, LnkSta Speed = 2.5 GT/s

  • During heavy load, it switches to 16 GT/s (Gen4), which seems normal

  • However, Host to Device (H2D) bandwidth is extremely slow, while Device to Host (D2H) is fast

CUDA Bandwidth Test results:

./Samples/1_Utilities/bandwidthTest/bandwidthTest --memory=pinned --device=0

[CUDA Bandwidth Test] - Starting…Running on…

Device 0: NVIDIA GeForce RTX 3090Quick Mode

Host to Device Bandwidth, 1 Device(s)PINNED Memory TransfersTransfer Size (Bytes) Bandwidth(GB/s)32000000 0.4

Device to Host Bandwidth, 1 Device(s)PINNED Memory TransfersTransfer Size (Bytes) Bandwidth(GB/s)32000000 26.3

Device to Device Bandwidth, 1 Device(s)PINNED Memory TransfersTransfer Size (Bytes) Bandwidth(GB/s)32000000 699.9

Result = PASS

nvidia-smi topo -m:

GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID

GPU0 X SYS SYS SYS 0-23 0 N/A
GPU1 SYS X NV4 SYS 0-23 0 N/A
GPU2 SYS NV4 X SYS 0-23 0 N/A
GPU3 SYS SYS SYS X 0-23 0 N/A

Question:

Why is the H2D bandwidth stuck at Gen1 speed (~0.4 GB/s) while D2H reaches Gen4 (~26 GB/s)?

  • Could this be a hardware issue (motherboard, CPU lanes, PCIe slot)?

  • Or is it related to driver, CUDA, or BIOS settings?

  • Any suggestions for further debugging?

Thanks in advance for your help!