H<->D memcpy bottleneck for multi-thread application

Hello,
I am developing a CUDA application using multiple host threads. Several host threads are launched, each of which tries to occupy one existing CUDA device, does H2D transfer, computing and D2H transfer. After the data is copied back, the threads will evaluate the results and perform some memory copy on host side.
There is only one CUDA device in the system. I use several threads to take advantage of this device, in order to hide the cost of host side operations. But I have observed that the H<->D transfer bandwidth decreases while the number of host threads increases. I’ve also noticed that the PCI-e 3.0 (x16) has a max H<->D bandwidth 16GB/s, while the host side DDR3 memory bandwidth is 12GB/s. Does this mean that the bottleneck for H<->D transfer in this case is the host memory bandwidth?
If that is the answer, is there any way to realize my initial design idea?

PCIe uses packetized transport, and packet-size restrictions on both the host side and the GPU side limit the practically achievable bandwidth to about 12 GB /second in each direction. Due to fixed-sized overhead of PCIe transfers, achieving this maximum possible throughput requires fairly large individual transfers, often >= 4 MB.

PCIe provides a full-duplex link, so for GPUs with two DMA engines (e.g. Tesla) you could have two streams shuffling close to 12 GB / second of data going from host to device and from device to host simultaneously. Obviously the host’s system memory serves as source / sink for this data, and a slow host system memory can indeed negatively impact the transfers, even more so when synchronous transfers are used, which use an additional intermediate copy inside the host’s system memory.

However, modern PC platforms use multiple DDR4 channels as system memory which can easily feed a PCIe3 x16 link. Even before the introduction of DDR4, PC platforms typically had multiple DDR3 channels which also provided sufficient bandwidth. I recall system memory bandwidth issues impacting performance of CUDA programs only from very old GPU-accelerated PC platforms (pre 2010? my memory is hazy). Are you sure your host’s system memory bandwidth is limited to 12 GB/sec? What CPU does it use? How many channels and what speed grade of DDR3?

In general you would want to minimize CPU <-> GPU copies. To minimize bandwidth requirements, in some contexts it can make sense to use some cheap on-the-fly compression, or chose a smaller data type. Much real-life data originates from sensors providing resolution of only 11 to 15 bits.

GPU kernels run asynchronously to the CPU, and you can (and should) use asynchronous copies to overlap GPU work with copy operations. So it is not clear to me why you need multiple host threads interacting with the device. A single host thread using multiple CUDA streams asynchronously is likely closer to what you want, but then I don’t know the details of your use case.

Thank you for reply, njuffa. ^_^
The application was tested on a system with Intel Xeon E5 2680 CPU and one single channel DDR3 1600MHz memory bank. So I suppose the peak memory bandwidth to be 1600*8/1024 ~ 12GB/sec.
In my application, there are multiple host threads consuming a work queue. Each thread uses CUDA streams to do asynchronous copies, kernel launches and so on. For each task is large, a host thread needs rather a long time to evaluate and process the results returned by device side. That is why I want to use multiple host threads to occupy the CUDA device (a Tesla P100, in fact) in turn and reduce the idle time of GPU. There will be a lot of memory movement operations on host side. I wonder if these memory operations in each thread limit the available memory bandwidth for host-device memory transfers (although pinned memory is used).
Will you please point out whether my idea has something wrong or give some suggestions? Thanks again.

Intel lists the Xeon E5 2680 processor as an octa-core CPU with four-channel DDR3 interface, providing 51.2 GB/sec system memory throughput when using DDR3-1600:

https://ark.intel.com/products/64583/Intel-Xeon-Processor-E5-2680-20M-Cache-2_70-GHz-8_00-GTs-Intel-QPI

I didn’t even know you can run these CPUs with just a single memory channel. You can measure the bandwidth of the system memory with the STREAM benchmark. I would be surprised if STREAM reports more than 10 GB/sec/channel of usable bandwidth.

The solution to any potential performance impact from lack of system memory throughput seems obvious: populate all four memory channels. Or at least two.

This is a red flag. Based on my reasonably long experience as a software engineer, there are exactly two questions that are almost always indicative of a software design problem: (1) How can I copy data faster? (2) What is the fastest way to invert a matrix? In both cases the best top-level answer is: You probably shouldn’t do that (i.e. moving data around, inverting matrices).

Is the application of a streaming nature? If so, where is the source data ultimately coming from? Has the application been carefully profiled? Does profiling indicate that data movement is the primary bottleneck?

The host memory bandwidth was benchmarked and proved to be a little low. So maybe to test the application on system with larger memory bandwidth will be a reasonable choice.
And for the mechanism of current application, yes, it is true that one should always avoid the movement of data. But for our circumstance, the work queue contains batches of data generated by an initializer during startup stage. After that, the host threads continuously consume the batches in the queue and process the batch using GPU. The returned batch should be re-collect to form new batches, each of which must contain data without divergence so they can be efficiently processed by GPU again. As a result, the re-collection operation (i.e. a memory exchange operation on host side) seems to be necessary.