Multiple threads calling CUDA API in parallel

samrivaKS · July 26, 2024, 2:38pm

I’ve observed that a cudaMemcpyAsync, as well as many other theoretically non-blocking CUDA APIs, actually pause the calling thread for some microseconds (~2us). Is it possible to get a general idea of what happens at the CUDA driver level? Does the thread remain locked, waiting for the command to be accepted by the GPU?

If multiple threads within the same CUDA context are launching commands in succession, does the CUDA driver accept them in parallel, or is there a mechanism that serializes them (at the CPU thread level, I mean?

In our application all those pauses are affecting the system more than data transfer and the actually GPU work which is usually extremely fast (congratulations!)
Thanks

Robert_Crovella · July 26, 2024, 2:53pm

see here:

Any CUDA API call may block or synchronize for various reasons such as contention for or unavailability of internal resources. Such behavior is subject to change and undocumented behavior should not be relied upon.

There are many questions like this on these forums, here is one recent one. Some of the questions that you can find show attempts to use debuggers to figure out what is going on at a lower level, and there is indication in some of them of lock negotiation, consistent with the statement in the documentation.

No, I don’t have amazing suggestions for how to make this go away. Issuing all work from a single thread should certainly mitigate the variation in latency. Yes, I understand that is not a very palatable suggestion. I have no others.

samrivaKS · July 26, 2024, 3:06pm

Thank you very much for your fast reply.
Do you think using combined CPU+GPU (Grace) would reduce those latencies?

I see NVIDIA Aerial (which is a low latency application with many very small and fast kernels) is trying to reduce CPU/GPU interactions to zero so I think this is still the bottleneck of the system

njuffa · July 26, 2024, 4:36pm

You may want to try a scaling experiment using host CPUs of different single-thread performance. To first order, single-thread performance will be dominated by CPU clock frequency.

The 2 microsecond delay mentioned in the original post seems indicative of basic host / device communication overhead. Given that, minimizing CPU-GPU interactions of this sort seems like a sensible optimization strategy.

My (limited) understanding is that this overhead is largely attributable to PCIe hardware mechanisms, and only the smaller portion of it to software overhead on the host. The scaling experiment I suggested would explore the impact of the software component.

My usual recommendation is to use CPUs running at > 3.5 GHz to minimize impacts from driver overhead. SPEC CPU2017 Int Speed results suggest that among CPUs for which data was submitted the highest single-thread performance is offered by AMD EPYC 4564P and AMD Ryzen 9 7950X. According to the specifications, these are processors with a nominal clock of 4.5 GHz and a maximum boost clock of 5.7 GHz.

I do not know whether newer PCIe standards offer lower latencies; improvements in PCIe are typically focused on improving bandwidth. As far as I know there are no GPUs yet that offer PCIe5 support, but once there are, you could compare PCIe5 versus PCIe4 to see whether you observe any improvements.

I don’t have any insight into the Grace platform.

system · August 9, 2024, 4:37pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Single or multiple CPU threads using same GPU? CUDA Programming and Performance cuda , performance	5	2620	September 14, 2023
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9593	January 1, 2009
Multicore Host - GPU relationship CUDA Programming and Performance	3	2320	July 13, 2009
Too much time for kernel launch latency CUDA Programming and Performance	9	2559	November 28, 2022
What is the purpose that use asynchronous CUDA APIs CUDA Programming and Performance	8	1528	February 15, 2022
cuda with multicore (multitasking) multicore CPU(for multitasking) and CUDA CUDA Programming and Performance	13	12032	February 23, 2009
Multiple CPU threads Performance hit CUDA Programming and Performance	5	5381	February 28, 2008
My first test on CUDA and some questions sync, thread with CUDA CUDA Programming and Performance	5	3024	November 13, 2007
cudaMemcpy latency increases when using 1 device with 2 processes CUDA Programming and Performance cuda	7	61	March 3, 2025
Help with concurrency.. Not any improvement in total cycle time CUDA Programming and Performance	2	397	November 7, 2017

Multiple threads calling CUDA API in parallel

Related topics