I found that the concurrent H2D & D2H memory copy operations have bandwidth contention.
For example, in my system (x16 PCIe 3.0 + RTX 3080), the H2D and D2H bandwidth is around 12 GB/s:
> ./test_bandwidth D2H
Device-to-Host Bandwidth: 11.9104 GB/s
> ./test_bandwidth H2D
Host-to-Device Bandwidth: 12.2586 GB/s
But if run both memory copy in two concurrent CUDA streams:
> ./test_bandwidth Concurrent
Host-to-Device Bandwidth: 9.4521 GB/s
Device-to-Host Bandwidth: 9.45441 GB/s
Both bandwidths are reduced about 30%. Where is the contention from?
Various pieces of information that you have not provided may matter. Here is a recent thread that discusses something similar. You might get better assistance or responses by providing more information.
Contention can arise in the PCIe interconnect, any applicable inter-processor interconnect (with a single CPU, this can still apply when the CPU is based on chiplets internally), and the system memory.
In my experience, some amount of throughput reduction is normal when applying maximum bi-directional PCIe throughput when communicating with the GPU, but typically it is less than the 22% reduction seen here (18.9 GB/sec = 78.2% of 24.2 GB/sec).
Assuming that this is a straightforward single-socket system and that a sufficiently large transfer size was chosen for the bandwidth test, the uni-directional bandwidth already seems a tad on the low side. Ideally this should be closer to 13 GB/sec.
My best guess is that this is a case of a host system with an older CPU and / or relatively low system memory bandwidth (i.e. a small number of DDR4 channels) and / or system memory comprised of a lower speed grade of DRAM (less than DDR4-3200).
What is the CPU model used in this system? What speed grade of DDR4 is used? Are all DIMM slots populated? What transfer size was used in the bandwidth test? You would want to use a transfer size >= 16MB for maximum throughput. Is
test_bandwidth your own creation or code taken from NVIDIA’s sample apps or from an open-source repository?