Asymmetric PCIe bandwidth in bidirectional transfers: H2D drops 56% while D2H maintains performance

I’m observing asymmetric bandwidth behavior when running bidirectional Host-Device memory transfers using nvbandwidth:

  • Unidirectional: Both H2D and D2H achieve ~25 GB/s ✅
  • Bidirectional: D2H maintains ~24 GB/s, but H2D drops to ~11 GB/s (56% reduction) ❌

Since PCIe 4.0 is full-duplex, I expected both directions to maintain near-peak performance simultaneously. Is this expected behavior, or is there a bottleneck I should investigate?


Environment

Hardware:

  • GPU: 8× NVIDIA A100-SXM4-80GB
  • CPU: Dual-socket (NUMA topology below)
  • Memory: DDR4 8-channel per socket
  • Interconnect: PCIe 4.0 x16

Software:

  • nvbandwidth: v0.8
  • CUDA Runtime: 12.6.0
  • CUDA Driver: 12.6.0
  • Driver Version: 560.35.05
  • OS: Linux

GPU Topology (nvidia-smi topo -m):

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA
GPU0     X      PXB     NODE    NODE    SYS     SYS     SYS     SYS     0-31,64-95      0
GPU1    PXB      X      NODE    NODE    SYS     SYS     SYS     SYS     0-31,64-95      0
GPU2    NODE    NODE     X      PXB     SYS     SYS     SYS     SYS     0-31,64-95      0
GPU3    NODE    NODE    PXB      X      SYS     SYS     SYS     SYS     0-31,64-95      0
GPU4    SYS     SYS     SYS     SYS      X      PXB     NODE    NODE    32-63,96-127    1
GPU5    SYS     SYS     SYS     SYS     PXB      X      NODE    NODE    32-63,96-127    1
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      PXB     32-63,96-127    1
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    PXB      X      32-63,96-127    1

Test Results

1. Unidirectional H2D (Baseline) ✅

$ ./nvbandwidth -t host_to_device_memcpy_ce

Result: ~24.8 GB/s per GPU

memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
           0         1         2         3         4         5         6         7
 0     24.81     24.80     24.84     24.83     24.88     24.83     24.84     24.88

SUM: 198.72 GB/s

2. Unidirectional D2H (Baseline) ✅

$ ./nvbandwidth -t device_to_host_memcpy_ce

Result: ~25.9 GB/s per GPU

memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
           0         1         2         3         4         5         6         7
 0     25.89     25.88     25.89     25.89     25.89     25.89     25.89     25.90

SUM: 207.13 GB/s

3. Bidirectional D2H ✅

$ ./nvbandwidth -t device_to_host_bidirectional_memcpy_ce

Result: ~24.2 GB/s per GPU (maintains performance)

memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
           0         1         2         3         4         5         6         7
 0     24.25     24.26     24.24     24.26     24.27     24.26     24.27     24.27

SUM: 194.07 GB/s

4. Bidirectional H2D ❌ ISSUE

$ ./nvbandwidth -t host_to_device_bidirectional_memcpy_ce

Result: ~11 GB/s per GPU (56% performance drop)


Analysis

Observations:

  1. D2H bandwidth remains stable in bidirectional mode (~24 GB/s)
  2. H2D bandwidth drops significantly in bidirectional mode (~11 GB/s)
  3. Memory bandwidth is not saturated (verified with pcm-memory)
  4. PCIe 4.0 x16 theoretical: ~32 GB/s per direction (full-duplex)

Expected behavior:
Both H2D and D2H should maintain ~24-25 GB/s simultaneously in bidirectional mode.


Questions

  1. Is this asymmetry expected on A100 + PCIe 4.0 platforms?

  2. What could cause H2D throttling while D2H maintains performance?

    • PCIe root complex read/write arbitration?
    • CPU memory controller behavior?
    • CUDA driver scheduling policy?
    • IOMMU overhead?
  3. Diagnostic steps: What tools or tests would help identify the bottleneck?

    • nvidia-smi metrics to monitor?
    • PCIe link utilization tools?
    • Kernel/BIOS parameters to check?
  4. Mitigation: Are there configuration changes that could improve H2D bidirectional performance?


Additional Context

  • Standard BIOS settings (no special PCIe tuning)
  • Default CPU affinity (no manual pinning)
  • Issue reproducible across all 8 GPUs
  • Consistent across multiple test runs

Any insights would be greatly appreciated. Thank you!

Related GitHub Issue: https://github.com/NVIDIA/nvbandwidth/issues/53

There might be some drop in the bidirectional case compared to the peak unidirectional measurement (like the ~5% you show), but I wouldn’t expect it to be ~50%. You don’t seem to actually show the tool output in that case.

The A100-SXM4 machines that I am familiar with should have GPUs that are connected via NVLink. Your nvidia-smi topo output doesn’t seem to indicate this. That is quite curious.

Beyond that the only suggestion I have at the moment is to see if process pinning (e.g. via taskset or numactl) has any effect.