CudaMemcpyAsync wait long time to launch

herrywangyi163 · April 7, 2022, 3:12am

My kernel and Cuda API seems wait a long time to launch, And I don’t know why. And It seems happen when my GPU heavily used, maybe around 80% utilization rate.

Robert_Crovella · April 7, 2022, 2:10pm

cudaMemcpyAsync obeys stream semantics. That means that regardless of when you launch it, it will not begin until the previous activity in the stream has completed. (Other factors may also introduce additional delays, such as multiple “competing” cudaMemcpyAsync requests for transfers in the same direction). This “gap” between when your code issues the request and when it actually runs and completes shows up in the “API” section of the profiler timelines.

When the GPU is “used heavily”, other activity in the stream, or other “competing” requests, may cause the execution to take place “later”. This also applies in an identical fashion to kernel launches.

There is not enough profiler excerpt here to make any specific statements, and I generally find working with the profiler through a forum session to be difficult anyway, but you should be able to observe specific reasons related to when your kernel launches execute or when your copy operations execute via careful study of the profiler timelines.

njuffa · April 7, 2022, 5:40pm

How long exactly is “a long time”? Microseconds, milliseconds, seconds?

It is not clear what the concern is here. In any system where multiple flows of execution (e.g. CUDA streams or CPU threads) share physical resources (e.g. memory, communication links, or computational engines), the latency experienced by each of the flows tends to increase as total throughput is maximized. The question mentions 80% GPU utilization, so this scenario seems applicable here. Generally speaking, GPUs are designed for throughput maximization.

If latency is indeed a concern here and the motivation for the question: Latency issues are sometimes addressable by breaking the demands on physical resources into smaller chunks (e.g. reduction in the runtime of kernels, or the size of data copies). However, such a move may also be counterproductive if the use of physical resources involves fixed startup costs or synchronization overhead. Latency can sometimes also be improved incrementally by minimizing synchronization between execution flows.

herrywangyi163 · April 11, 2022, 3:34am

Thanks for your reply!

My GPU code is called frequently, and when GPU is under 10% GPU utilization, most latency is around 0.05 ms. But when GPU utilization is around 80%, some calls’ latency is around 1~2 ms or even longer, which is not acceptable.
Is there any suggestions to fix this? I would very appreciate it.

njuffa · April 11, 2022, 3:41am

I made some suggestions of things to try in my previous post. They may or may not be applicable to your application, about which I know nothing.

herrywangyi163 · April 11, 2022, 3:44am

I will try that, thank you any way.

herrywangyi163 · April 11, 2022, 3:55am

Thanks for your reply!

So if “competing” requests delay the kernel launch time, Is there any way to avoid this? And let my this few kernels launch immediately.

njuffa · April 11, 2022, 4:35am

You may find some other useful ideas in the following publication (I have only skimmed through it, so this is not an endorsement):

Ming Yang, et al., “Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems.” In Proceedings of the 30th Euromicro Conference on Real-Time Systems, July 2018, https://par.nsf.gov/servlets/purl/10064503.pdf

Robert_Crovella · April 11, 2022, 7:32am

Get more GPUs
For kernels launched from the same host process, you could investigate stream priorities.
For kernels launched from separate host processes, you could investigate the resource reservation/provisioning features available in CUDA MPS.

Depending on your exact usage, its possible that none of those methods might be particularly useful to get “right now” behavior on cudaMemcpyAsync. That is largely going to be a function of the work enqueued and the order in which it is enqueued. Given an arbitrary backlog of work in a queue that is delivering data to or from the GPU using cudaMemcpyAsync, there isn’t anything I can think of to make it go “right now”. And in a stream-oriented workflow, such a desire could be illogical. You don’t want stream oriented work to begin until the stream is finished with previously issued work.

Topic		Replies	Views
Overlapping kernel computing with stream per (CPU) thread, slow kernel launches CUDA Programming and Performance	10	3672	October 21, 2017
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25098	March 8, 2010
Why Cuda Kernel Launch Takes so much time ？ CUDA Programming and Performance cuda , gstreamer	1	830	November 9, 2023
Using multiple streams with multiple host threads takes longer? stream CUDA Programming and Performance	3	900	February 10, 2021
cudaMemcpyAsync Func Used too long time. CUDA Programming and Performance	5	2326	July 15, 2019
Too much time for kernel launch latency CUDA Programming and Performance	9	2552	November 28, 2022
Kernel operation delays when gpu is idle Profiling Linux Targets cuda , kernel , python	10	471	March 20, 2024
cudaMemcpy latency increases when using 1 device with 2 processes CUDA Programming and Performance cuda	7	61	March 3, 2025
Kernel executed in non-default CUDA stream waits for other streams to complete cudaMemcpyAsync CUDA Programming and Performance cuda	15	109	November 18, 2024
Some kernel launch is taking much longer (100x) than others in the same Cuda Stream CUDA Programming and Performance	7	440	February 10, 2024

CudaMemcpyAsync wait long time to launch

Related topics