Hello,
I am running deep learning training on a server and noticed that my DataLoader is very slow. After some checks, I found strange PCIe bandwidth behavior.
Environment:
-
Server: WS940
-
CPU: supports PCIe 64
-
Motherboard: PCIe 4.0 Gen4 supported
-
GPU: NVIDIA GeForce RTX 30902 + A6000*2
-
CUDA version: 12.4
-
Driver: [please fill in your driver version]
-
OS: [Ubuntu 24.04]
Before testing, I adjusted the BIOS settings so that SLOT1,3,5,7 are set to Gen4 and enabled Re-Train.
Observation:
-
Before loading, LnkSta Speed = 2.5 GT/s
-
During heavy load, it switches to 16 GT/s (Gen4), which seems normal
-
However, Host to Device (H2D) bandwidth is extremely slow, while Device to Host (D2H) is fast
CUDA Bandwidth Test results:
./Samples/1_Utilities/bandwidthTest/bandwidthTest --memory=pinned --device=0
[CUDA Bandwidth Test] - Starting…Running on…
Device 0: NVIDIA GeForce RTX 3090Quick Mode
Host to Device Bandwidth, 1 Device(s)PINNED Memory TransfersTransfer Size (Bytes) Bandwidth(GB/s)32000000 0.4
Device to Host Bandwidth, 1 Device(s)PINNED Memory TransfersTransfer Size (Bytes) Bandwidth(GB/s)32000000 26.3
Device to Device Bandwidth, 1 Device(s)PINNED Memory TransfersTransfer Size (Bytes) Bandwidth(GB/s)32000000 699.9
Result = PASS
nvidia-smi topo -m:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS SYS SYS 0-23 0 N/A
GPU1 SYS X NV4 SYS 0-23 0 N/A
GPU2 SYS NV4 X SYS 0-23 0 N/A
GPU3 SYS SYS SYS X 0-23 0 N/A
Question:
Why is the H2D bandwidth stuck at Gen1 speed (~0.4 GB/s) while D2H reaches Gen4 (~26 GB/s)?
-
Could this be a hardware issue (motherboard, CPU lanes, PCIe slot)?
-
Or is it related to driver, CUDA, or BIOS settings?
-
Any suggestions for further debugging?
Thanks in advance for your help!