I’m observing asymmetric bandwidth behavior when running bidirectional Host-Device memory transfers using nvbandwidth:
- Unidirectional: Both H2D and D2H achieve ~25 GB/s ✅
- Bidirectional: D2H maintains ~24 GB/s, but H2D drops to ~11 GB/s (56% reduction) ❌
Since PCIe 4.0 is full-duplex, I expected both directions to maintain near-peak performance simultaneously. Is this expected behavior, or is there a bottleneck I should investigate?
Environment
Hardware:
- GPU: 8× NVIDIA A100-SXM4-80GB
- CPU: Dual-socket (NUMA topology below)
- Memory: DDR4 8-channel per socket
- Interconnect: PCIe 4.0 x16
Software:
- nvbandwidth: v0.8
- CUDA Runtime: 12.6.0
- CUDA Driver: 12.6.0
- Driver Version: 560.35.05
- OS: Linux
GPU Topology (nvidia-smi topo -m):
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA
GPU0 X PXB NODE NODE SYS SYS SYS SYS 0-31,64-95 0
GPU1 PXB X NODE NODE SYS SYS SYS SYS 0-31,64-95 0
GPU2 NODE NODE X PXB SYS SYS SYS SYS 0-31,64-95 0
GPU3 NODE NODE PXB X SYS SYS SYS SYS 0-31,64-95 0
GPU4 SYS SYS SYS SYS X PXB NODE NODE 32-63,96-127 1
GPU5 SYS SYS SYS SYS PXB X NODE NODE 32-63,96-127 1
GPU6 SYS SYS SYS SYS NODE NODE X PXB 32-63,96-127 1
GPU7 SYS SYS SYS SYS NODE NODE PXB X 32-63,96-127 1
Test Results
1. Unidirectional H2D (Baseline) ✅
$ ./nvbandwidth -t host_to_device_memcpy_ce
Result: ~24.8 GB/s per GPU
memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 24.81 24.80 24.84 24.83 24.88 24.83 24.84 24.88
SUM: 198.72 GB/s
2. Unidirectional D2H (Baseline) ✅
$ ./nvbandwidth -t device_to_host_memcpy_ce
Result: ~25.9 GB/s per GPU
memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 25.89 25.88 25.89 25.89 25.89 25.89 25.89 25.90
SUM: 207.13 GB/s
3. Bidirectional D2H ✅
$ ./nvbandwidth -t device_to_host_bidirectional_memcpy_ce
Result: ~24.2 GB/s per GPU (maintains performance)
memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 24.25 24.26 24.24 24.26 24.27 24.26 24.27 24.27
SUM: 194.07 GB/s
4. Bidirectional H2D ❌ ISSUE
$ ./nvbandwidth -t host_to_device_bidirectional_memcpy_ce
Result: ~11 GB/s per GPU (56% performance drop)
Analysis
Observations:
- D2H bandwidth remains stable in bidirectional mode (~24 GB/s)
- H2D bandwidth drops significantly in bidirectional mode (~11 GB/s)
- Memory bandwidth is not saturated (verified with
pcm-memory) - PCIe 4.0 x16 theoretical: ~32 GB/s per direction (full-duplex)
Expected behavior:
Both H2D and D2H should maintain ~24-25 GB/s simultaneously in bidirectional mode.
Questions
-
Is this asymmetry expected on A100 + PCIe 4.0 platforms?
-
What could cause H2D throttling while D2H maintains performance?
- PCIe root complex read/write arbitration?
- CPU memory controller behavior?
- CUDA driver scheduling policy?
- IOMMU overhead?
-
Diagnostic steps: What tools or tests would help identify the bottleneck?
nvidia-smimetrics to monitor?- PCIe link utilization tools?
- Kernel/BIOS parameters to check?
-
Mitigation: Are there configuration changes that could improve H2D bidirectional performance?
Additional Context
- Standard BIOS settings (no special PCIe tuning)
- Default CPU affinity (no manual pinning)
- Issue reproducible across all 8 GPUs
- Consistent across multiple test runs
Any insights would be greatly appreciated. Thank you!
Related GitHub Issue: https://github.com/NVIDIA/nvbandwidth/issues/53