CudaMemcpyAsync wait long time to launch

My kernel and Cuda API seems wait a long time to launch, And I don’t know why. And It seems happen when my GPU heavily used, maybe around 80% utilization rate.

cudaMemcpyAsync obeys stream semantics. That means that regardless of when you launch it, it will not begin until the previous activity in the stream has completed. (Other factors may also introduce additional delays, such as multiple “competing” cudaMemcpyAsync requests for transfers in the same direction). This “gap” between when your code issues the request and when it actually runs and completes shows up in the “API” section of the profiler timelines.

When the GPU is “used heavily”, other activity in the stream, or other “competing” requests, may cause the execution to take place “later”. This also applies in an identical fashion to kernel launches.

There is not enough profiler excerpt here to make any specific statements, and I generally find working with the profiler through a forum session to be difficult anyway, but you should be able to observe specific reasons related to when your kernel launches execute or when your copy operations execute via careful study of the profiler timelines.

How long exactly is “a long time”? Microseconds, milliseconds, seconds?

It is not clear what the concern is here. In any system where multiple flows of execution (e.g. CUDA streams or CPU threads) share physical resources (e.g. memory, communication links, or computational engines), the latency experienced by each of the flows tends to increase as total throughput is maximized. The question mentions 80% GPU utilization, so this scenario seems applicable here. Generally speaking, GPUs are designed for throughput maximization.

If latency is indeed a concern here and the motivation for the question: Latency issues are sometimes addressable by breaking the demands on physical resources into smaller chunks (e.g. reduction in the runtime of kernels, or the size of data copies). However, such a move may also be counterproductive if the use of physical resources involves fixed startup costs or synchronization overhead. Latency can sometimes also be improved incrementally by minimizing synchronization between execution flows.

Thanks for your reply!

My GPU code is called frequently, and when GPU is under 10% GPU utilization, most latency is around 0.05 ms. But when GPU utilization is around 80%, some calls’ latency is around 1~2 ms or even longer, which is not acceptable.
Is there any suggestions to fix this? I would very appreciate it.

I made some suggestions of things to try in my previous post. They may or may not be applicable to your application, about which I know nothing.

I will try that, thank you any way.

Thanks for your reply!

So if “competing” requests delay the kernel launch time, Is there any way to avoid this? And let my this few kernels launch immediately.

You may find some other useful ideas in the following publication (I have only skimmed through it, so this is not an endorsement):

Ming Yang, et al., “Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems.” In Proceedings of the 30th Euromicro Conference on Real-Time Systems, July 2018, https://par.nsf.gov/servlets/purl/10064503.pdf

  1. Get more GPUs
  2. For kernels launched from the same host process, you could investigate stream priorities.
  3. For kernels launched from separate host processes, you could investigate the resource reservation/provisioning features available in CUDA MPS.

Depending on your exact usage, its possible that none of those methods might be particularly useful to get “right now” behavior on cudaMemcpyAsync. That is largely going to be a function of the work enqueued and the order in which it is enqueued. Given an arbitrary backlog of work in a queue that is delivering data to or from the GPU using cudaMemcpyAsync, there isn’t anything I can think of to make it go “right now”. And in a stream-oriented workflow, such a desire could be illogical. You don’t want stream oriented work to begin until the stream is finished with previously issued work.