PCIe bandwidth issue: H2D very slow (Gen1), but D2H reaches Gen4

yang900412 · September 17, 2025, 5:17am

Hello,

I am running deep learning training on a server and noticed that my DataLoader is very slow. After some checks, I found strange PCIe bandwidth behavior.

Environment:

Server: WS940
CPU: supports PCIe 64
Motherboard: PCIe 4.0 Gen4 supported
GPU: NVIDIA GeForce RTX 30902 + A6000*2
CUDA version: 12.4
Driver: [please fill in your driver version]
OS: [Ubuntu 24.04]

Before testing, I adjusted the BIOS settings so that SLOT1,3,5,7 are set to Gen4 and enabled Re-Train.

Observation:

Before loading, LnkSta Speed = 2.5 GT/s
During heavy load, it switches to 16 GT/s (Gen4), which seems normal
However, Host to Device (H2D) bandwidth is extremely slow, while Device to Host (D2H) is fast

CUDA Bandwidth Test results:

./Samples/1_Utilities/bandwidthTest/bandwidthTest --memory=pinned --device=0

[CUDA Bandwidth Test] - Starting…Running on…

Device 0: NVIDIA GeForce RTX 3090Quick Mode

Host to Device Bandwidth, 1 Device(s)PINNED Memory TransfersTransfer Size (Bytes) Bandwidth(GB/s)32000000 0.4

Device to Host Bandwidth, 1 Device(s)PINNED Memory TransfersTransfer Size (Bytes) Bandwidth(GB/s)32000000 26.3

Device to Device Bandwidth, 1 Device(s)PINNED Memory TransfersTransfer Size (Bytes) Bandwidth(GB/s)32000000 699.9

Result = PASS

nvidia-smi topo -m:

GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID

GPU0 X SYS SYS SYS 0-23 0 N/A
GPU1 SYS X NV4 SYS 0-23 0 N/A
GPU2 SYS NV4 X SYS 0-23 0 N/A
GPU3 SYS SYS SYS X 0-23 0 N/A

Question:

Why is the H2D bandwidth stuck at Gen1 speed (~0.4 GB/s) while D2H reaches Gen4 (~26 GB/s)?

Could this be a hardware issue (motherboard, CPU lanes, PCIe slot)?
Or is it related to driver, CUDA, or BIOS settings?
Any suggestions for further debugging?

Thanks in advance for your help!

Topic		Replies	Views
lopsided bandwidthTest: D->H is 3X slower than H->D CUDA Programming and Performance	0	2167	June 3, 2009
Host to Device Bandwidth and PCEe 2.0 - not getting what I should! CUDA Programming and Performance	4	2786	February 18, 2009
Asymmetric PCIe bandwidth in bidirectional transfers: H2D drops 56% while D2H maintains performance CUDA Programming and Performance pcie , cuda , a100	1	81	December 2, 2025
Large CUDA Bandwidth Discrepancy on Identical RTX A4000 GPUs (EPYC 9124 vs. 7343, Supermicro H13SSL-N vs. H12SSL-CT) Compute Sanitizer pcie , cuda , linux	2	143	August 16, 2025
Concurrent bandwidth with multiple GPUs CUDA Programming and Performance	4	2786	December 12, 2011
Bandwidth contention of concurrent H2D & D2H memory copy CUDA Programming and Performance	2	2188	April 17, 2023
Why d2h is slower than h2d on device 0-3? CUDA Programming and Performance hw , cuda	3	687	February 3, 2023
Strange bandwidthTest results with new hardware Lower, and asymetric H->D, D->H CUDA Programming and Performance	18	26437	February 8, 2010
Bandwidth disparity between Host-Device-Device-Host CUDA Programming and Performance	2	983	August 24, 2011
Terrible host<->device bandwidth seen with bandwidthtest CUDA Programming and Performance	5	920	October 12, 2021

PCIe bandwidth issue: H2D very slow (Gen1), but D2H reaches Gen4

Related topics