Bandwidth contention of concurrent H2D & D2H memory copy

user103333 · April 17, 2023, 5:34am

I found that the concurrent H2D & D2H memory copy operations have bandwidth contention.

For example, in my system (x16 PCIe 3.0 + RTX 3080), the H2D and D2H bandwidth is around 12 GB/s:

> ./test_bandwidth D2H
Device-to-Host Bandwidth: 11.9104 GB/s

> ./test_bandwidth H2D
Host-to-Device Bandwidth: 12.2586 GB/s

But if run both memory copy in two concurrent CUDA streams:

> ./test_bandwidth Concurrent
Host-to-Device Bandwidth: 9.4521 GB/s
Device-to-Host Bandwidth: 9.45441 GB/s

Both bandwidths are reduced about 30%. Where is the contention from?

Robert_Crovella · April 17, 2023, 6:59pm

Various pieces of information that you have not provided may matter. Here is a recent thread that discusses something similar. You might get better assistance or responses by providing more information.

njuffa · April 17, 2023, 8:36pm

Contention can arise in the PCIe interconnect, any applicable inter-processor interconnect (with a single CPU, this can still apply when the CPU is based on chiplets internally), and the system memory.

In my experience, some amount of throughput reduction is normal when applying maximum bi-directional PCIe throughput when communicating with the GPU, but typically it is less than the 22% reduction seen here (18.9 GB/sec = 78.2% of 24.2 GB/sec).

Assuming that this is a straightforward single-socket system and that a sufficiently large transfer size was chosen for the bandwidth test, the uni-directional bandwidth already seems a tad on the low side. Ideally this should be closer to 13 GB/sec.

My best guess is that this is a case of a host system with an older CPU and / or relatively low system memory bandwidth (i.e. a small number of DDR4 channels) and / or system memory comprised of a lower speed grade of DRAM (less than DDR4-3200).

What is the CPU model used in this system? What speed grade of DDR4 is used? Are all DIMM slots populated? What transfer size was used in the bandwidth test? You would want to use a transfer size >= 16MB for maximum throughput. Is test_bandwidth your own creation or code taken from NVIDIA’s sample apps or from an open-source repository?

Topic		Replies	Views
Question about PCI-E transfer throughput CUDA Programming and Performance	13	369	April 5, 2025
H<->D memcpy bottleneck for multi-thread application CUDA Programming and Performance	4	1939	September 12, 2018
Concurrent bandwidth with multiple GPUs CUDA Programming and Performance	4	2786	December 12, 2011
concurrency among copies: is it possible? CUDA Programming and Performance	5	2779	December 7, 2012
concurrent D2H+H2D transfers? CUDA Programming and Performance	5	2583	May 10, 2016
PCIe bandwidth issue: H2D very slow (Gen1), but D2H reaches Gen4 GPU-Accelerated Libraries cuda , kernel , ubuntu	0	67	September 17, 2025
CUDA: combining H2D and D2H memory transfer operations CUDA Programming and Performance	7	3803	March 1, 2015
Multi gpu copy performance Any experiences to share? CUDA Programming and Performance	7	3467	February 3, 2010
Data transfers are slower when overlapped than when running sequentially CUDA Programming and Performance	9	1683	September 29, 2021
Bandwidth disparity between Host-Device-Device-Host CUDA Programming and Performance	2	983	August 24, 2011

Bandwidth contention of concurrent H2D & D2H memory copy

Related topics