Data transfer in/out CUDA API time unstable

Hi Sir,

In our design, we need to transfer a certain mount of data into and out of CUDA in a fixed time period, “figure 1” of the below figure is an example of data transfer into and out from CUDA with 5ms period.

As you can see, data transfer effieciency is faster before 150ms and after 410ms, but between 150 to 410ms data transfer efficiency is much lower.

“figure 2” of the below figure is the detail informatin before 150ms

and “figure 3” of the below figure is the detail information between 150 and 410ms

it is obvious that between 150~410ms cudaMemcpyAsync takes too much time to be effective, comparing to the case before 150ms.

We would like to know how to improve the performance of cudaMemcpyAsync command, any suggestion would be very appreciated, thanks.

The CUDA software stack is not designed for hard real-time environments. Actions are performed on a best effort basis. CUDA may be used in environments with soft real-time requirements, which implies that your application’s performance degrades gracefully if deadlines are not met (example: dropping frames in video stream).

The following checklist assumes use of a discrete GPU. If this question pertains to one of NVIDIA’s integrated embedded platforms (Jetson etc), please ask in the sub-forum dedicated to the specific platform.

(1) Make sure that your GPU is linked to the host system via a PCIe 4 x16 link. Ideally the GPU should be coupled to a x16 link provided directly by the host CPU (not by an intermediate PCIe switch).

(2) If this is a multi-socket system, or a single-socket system with a CPU internally constructed from multiple clusters, usenumactl to control processor and memory affinity such that the GPU communicates with the “near” CPU and “near” system memory.

(3) Make sure your host system uses DDR4-3200 memory, with as many channels populated as possible, ideally in an 8-channel configuration. DDR5 system memory would be a good alternative, but is new and currently quite expensive.

(4) PCIe transfers are packetized, so the total transfer time is minimized when data is transferred with as few individual transactions as are feasible (that is, in as large chunks as possible). Note that this may lead to tradeoffs between latency and throughput.

(5) Make sure the host system is lightly loaded. In a heavily loaded system, the system memory can become a bottleneck as it is a shared resource between software running on the host, and a source or destination of CPU<->GPU transfers).

(6) Use a CPU with high single-thread performance to minimize API overhead. To first order, this means choosing a CPU with high base frequency. I suggest using >= 3.5 GHz base clock.

Hi njuffa,

(4) PCIe transfers are packetized, so the total transfer time is minimized when data is transferred with as few individual transactions as are feasible (that is, in as large chunks as possible). Note that this may lead to tradeoffs between latency and throughput.

In the below figure, there are 100 transations on both host to device(green) and device to host(pink), do you mean if we can reduce the number of transaction (ex. from 100 to 50), the problem may be improved ?

Thanks.

Generally speaking PCIe throughput increases with transfers size and closely approaches the maximum only when the size of individual transfers reaches 10+ MB. You can easily measure this for yourself.

Assuming a constant volume of data, higher throughput means that the total time spent transferring data across PCIe is reduced, so usually it is a good idea to transfer the necessary data volume between CPU and GPU in as large chunks as possible. However, the latency of each data transfer also increases with its size, and this might have (potentially undesirable) knock-on effects in your software stack. The best way to find out how it all plays out is to run some experiments within your actual hardware/software context.