Can multiple cudaMemcpyAsync be executed in parallel?

Here is an example code snippet:

cudaMallocHost(a)
cudaMallocHost(b)

cudaMalloc(d_a)
cudaMalloc(d_b)

cudaMemcpyAsync(d_a, a, size, cudaMemcpyHostToDevice, s1);
cudaMemcpyAsync(d_a, a, size, cudaMemcpyHostToDevice, s2);

In the above example, multiple cudaMemcpyHostToDevices are scheduled on different streams. Provided that the GPU architecture has enough copy engines, will the two overlap with each other for better performance ?

The PCIe bandwith is limited. If one transfer uses the full bandwith, another transfer cannot run concurrently.

It is important to keep in mind that PCIe is a full-duplex interconnect. If there are DMA transfers across a PCIe link in different directions, they can therefore run concurrently. Two transfers in the same direction on the same PCIe link cannot run concurrently for the reason given by @striker159. Sharing the available bandwidth between multiple transfers in the same direction would not improve overall throughput.

So, Can you please tell me what the copy engines are for? Will more copy engines help in improving performance?

A GPU can utilize two DMA units concurrently for a PCIe link, one per direction. Some GPUs may have other links instead of or in addition to PCIe and additional DMA units may be associated with those links.

From what I have seen in recent years CUDA reports various numbers of “copy engines” per GPU (I have seen up to six being reported), with the count at times varying with CUDA version. No idea what that is about; I am not aware of relevant documentation. You can try and ask NVIDIA to clarify how they count “copy engines”.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.