PCIe bandwidth issue: H2D very slow (Gen1), but D2H reaches Gen4

Hello,

I am running deep learning training on a server and noticed that my DataLoader is very slow. After some checks, I found strange PCIe bandwidth behavior.

Environment:

  • Server: WS940

  • CPU: supports PCIe 64

  • Motherboard: PCIe 4.0 Gen4 supported

  • GPU: NVIDIA GeForce RTX 30902 + A6000*2

  • CUDA version: 12.4

  • Driver: [please fill in your driver version]

  • OS: [Ubuntu 24.04]

Before testing, I adjusted the BIOS settings so that SLOT1,3,5,7 are set to Gen4 and enabled Re-Train.

Observation:

  • Before loading, LnkSta Speed = 2.5 GT/s

  • During heavy load, it switches to 16 GT/s (Gen4), which seems normal

  • However, Host to Device (H2D) bandwidth is extremely slow, while Device to Host (D2H) is fast

CUDA Bandwidth Test results:

./Samples/1_Utilities/bandwidthTest/bandwidthTest --memory=pinned --device=0

[CUDA Bandwidth Test] - Starting…Running on…

Device 0: NVIDIA GeForce RTX 3090Quick Mode

Host to Device Bandwidth, 1 Device(s)PINNED Memory TransfersTransfer Size (Bytes) Bandwidth(GB/s)32000000 0.4

Device to Host Bandwidth, 1 Device(s)PINNED Memory TransfersTransfer Size (Bytes) Bandwidth(GB/s)32000000 26.3

Device to Device Bandwidth, 1 Device(s)PINNED Memory TransfersTransfer Size (Bytes) Bandwidth(GB/s)32000000 699.9

Result = PASS

nvidia-smi topo -m:

GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID

GPU0 X SYS SYS SYS 0-23 0 N/A
GPU1 SYS X NV4 SYS 0-23 0 N/A
GPU2 SYS NV4 X SYS 0-23 0 N/A
GPU3 SYS SYS SYS X 0-23 0 N/A

Question:

Why is the H2D bandwidth stuck at Gen1 speed (~0.4 GB/s) while D2H reaches Gen4 (~26 GB/s)?

  • Could this be a hardware issue (motherboard, CPU lanes, PCIe slot)?

  • Or is it related to driver, CUDA, or BIOS settings?

  • Any suggestions for further debugging?

Thanks in advance for your help!

PCIe Link State Reporting in nvidia-smi

NVIDIA-SMI reports both the maximum supported PCIe generation and the currently negotiated link state. According to NVIDIA’s documentation, pcie.link.gen.current reflects the active link state and is reduced when the GPU is not in use due to PCIe power management:

Under sustained load, the link should train to its maximum negotiated generation. If it remains at Gen1 while large H2D transfers are occurring, that indicates a configuration or link negotiation constraint (e.g., BIOS PCIe settings, slot topology, or training behavior).


To observe the current PCIe link state:

nvidia-smi --query-gpu=pcie.link.gen.max,pcie.link.gen.current,pcie.link.width.max,pcie.link.width.current --format=csv,noheader -l 1

Run this while your existing bandwidth test is executing to see if the link transitions and holds at its maximum generation.

For direct correlation between negotiated link state and measured pinned host↔device throughput, I created a PCIe transport validation tool:

https://github.com/parallelArchitect/gpu-pcie-path-validator

It reports negotiated link state and pinned transfer bandwidth in a single diagnostic run.