Asymmetric PCIe bandwidth in bidirectional transfers: H2D drops 56% while D2H maintains performance

chenrh2020 · November 21, 2025, 10:39am

I’m observing asymmetric bandwidth behavior when running bidirectional Host-Device memory transfers using nvbandwidth:

Unidirectional: Both H2D and D2H achieve ~25 GB/s ✅
Bidirectional: D2H maintains ~24 GB/s, but H2D drops to ~11 GB/s (56% reduction) ❌

Since PCIe 4.0 is full-duplex, I expected both directions to maintain near-peak performance simultaneously. Is this expected behavior, or is there a bottleneck I should investigate?

Environment

Hardware:

GPU: 8× NVIDIA A100-SXM4-80GB
CPU: Dual-socket (NUMA topology below)
Memory: DDR4 8-channel per socket
Interconnect: PCIe 4.0 x16

Software:

nvbandwidth: v0.8
CUDA Runtime: 12.6.0
CUDA Driver: 12.6.0
Driver Version: 560.35.05
OS: Linux

GPU Topology (nvidia-smi topo -m):

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA
GPU0     X      PXB     NODE    NODE    SYS     SYS     SYS     SYS     0-31,64-95      0
GPU1    PXB      X      NODE    NODE    SYS     SYS     SYS     SYS     0-31,64-95      0
GPU2    NODE    NODE     X      PXB     SYS     SYS     SYS     SYS     0-31,64-95      0
GPU3    NODE    NODE    PXB      X      SYS     SYS     SYS     SYS     0-31,64-95      0
GPU4    SYS     SYS     SYS     SYS      X      PXB     NODE    NODE    32-63,96-127    1
GPU5    SYS     SYS     SYS     SYS     PXB      X      NODE    NODE    32-63,96-127    1
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      PXB     32-63,96-127    1
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    PXB      X      32-63,96-127    1

Test Results

1. Unidirectional H2D (Baseline) ✅

$ ./nvbandwidth -t host_to_device_memcpy_ce

Result: ~24.8 GB/s per GPU

memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
           0         1         2         3         4         5         6         7
 0     24.81     24.80     24.84     24.83     24.88     24.83     24.84     24.88

SUM: 198.72 GB/s

2. Unidirectional D2H (Baseline) ✅

$ ./nvbandwidth -t device_to_host_memcpy_ce

Result: ~25.9 GB/s per GPU

memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
           0         1         2         3         4         5         6         7
 0     25.89     25.88     25.89     25.89     25.89     25.89     25.89     25.90

SUM: 207.13 GB/s

3. Bidirectional D2H ✅

$ ./nvbandwidth -t device_to_host_bidirectional_memcpy_ce

Result: ~24.2 GB/s per GPU (maintains performance)

memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
           0         1         2         3         4         5         6         7
 0     24.25     24.26     24.24     24.26     24.27     24.26     24.27     24.27

SUM: 194.07 GB/s

4. Bidirectional H2D ❌ ISSUE

$ ./nvbandwidth -t host_to_device_bidirectional_memcpy_ce

Result: ~11 GB/s per GPU (56% performance drop)

Analysis

Observations:

D2H bandwidth remains stable in bidirectional mode (~24 GB/s)
H2D bandwidth drops significantly in bidirectional mode (~11 GB/s)
Memory bandwidth is not saturated (verified with pcm-memory)
PCIe 4.0 x16 theoretical: ~32 GB/s per direction (full-duplex)

Expected behavior:
Both H2D and D2H should maintain ~24-25 GB/s simultaneously in bidirectional mode.

Questions

Is this asymmetry expected on A100 + PCIe 4.0 platforms?
What could cause H2D throttling while D2H maintains performance?
- PCIe root complex read/write arbitration?
- CPU memory controller behavior?
- CUDA driver scheduling policy?
- IOMMU overhead?
Diagnostic steps: What tools or tests would help identify the bottleneck?
- nvidia-smi metrics to monitor?
- PCIe link utilization tools?
- Kernel/BIOS parameters to check?
Mitigation: Are there configuration changes that could improve H2D bidirectional performance?

Additional Context

Standard BIOS settings (no special PCIe tuning)
Default CPU affinity (no manual pinning)
Issue reproducible across all 8 GPUs
Consistent across multiple test runs

Any insights would be greatly appreciated. Thank you!

Related GitHub Issue: https://github.com/NVIDIA/nvbandwidth/issues/53

Robert_Crovella · December 2, 2025, 11:55pm

There might be some drop in the bidirectional case compared to the peak unidirectional measurement (like the ~5% you show), but I wouldn’t expect it to be ~50%. You don’t seem to actually show the tool output in that case.

The A100-SXM4 machines that I am familiar with should have GPUs that are connected via NVLink. Your nvidia-smi topo output doesn’t seem to indicate this. That is quite curious.

Beyond that the only suggestion I have at the moment is to see if process pinning (e.g. via taskset or numactl) has any effect.

Topic		Replies	Views
Question about PCI-E transfer throughput CUDA Programming and Performance	13	369	April 5, 2025
PCIe bandwidth issue: H2D very slow (Gen1), but D2H reaches Gen4 GPU-Accelerated Libraries cuda , kernel , ubuntu	0	67	September 17, 2025
Bandwidth contention of concurrent H2D & D2H memory copy CUDA Programming and Performance	2	2188	April 17, 2023
PCIe bandwidth is asymmetrical between Host to Device and Device to Host GPU - Hardware pcie , cuda	0	261	January 2, 2025
lopsided bandwidthTest: D->H is 3X slower than H->D CUDA Programming and Performance	0	2167	June 3, 2009
Strange bandwidthTest results with new hardware Lower, and asymetric H->D, D->H CUDA Programming and Performance	18	26437	February 8, 2010
C2050 bidirectional async transfer slower than unidirectional CUDA Programming and Performance	7	2825	April 26, 2010
Bandwidth disparity between Host-Device-Device-Host CUDA Programming and Performance	2	983	August 24, 2011
Why d2h is slower than h2d on device 0-3? CUDA Programming and Performance hw , cuda	3	688	February 3, 2023
Concurrent bandwidth with multiple GPUs CUDA Programming and Performance	4	2786	December 12, 2011