Single or multiple CPU threads using same GPU?

samrivaKS · September 13, 2023, 8:30am

We’ve a single process/multi thread application with two threads running in isolated cores creating/updating/launching CUDA graphs to a single GPU.

Both threads are running in parallel and can launch graphs at same time. In my understanding GPU is able to allocate kernels in parallel from both threads (while a multi-process SW would need MPS or kernels would be serialised), which is great. Is this the best solution for a real-time low-latency application or do you suggest to have a single thread launching graphs? Do you have some lock mechanism on GPU threads? If having multiple CPU threads is forcing CUDA driver to use locks we can try to change our architecture

njuffa · September 13, 2023, 11:41am

How stringent are your real-time requirement? I assume this is an application with soft real-time requirements, i.e. it can tolerate the occasional missing of a deadline (for example, by dropping a frame if the context is video processing)? The entire CUDA software stack is based on a “best effort” model: there are no latency guarantees that would be needed for a hard real time use case.

In terms of application-level latency I would be most concerned about the run-time of launched kernels across the entire range of GPU hardware you intend to support.

An alternative to your current equal-worker design would be a master-worker design, in which one thread communicates with the GPU and all other threads communicate with the master-worker thread.

The reason this might be worth exploring is that latency-sensitive work such as coordinating access to a shared resource might be better targeted at the latency-optimized portion of the overall system, and that is generally the CPU (provided you use a fast CPU; my recommendation would be one with >= 3.5 GHz base clock). But I think it is hard to predict total system behavior knowing nothing about your setup. I would therefore would advocate experimentation within the context of your specific use case.

samrivaKS · September 13, 2023, 3:01pm

Hi, thank you for your answer!

Our application has hard real-time requirements, it’s a 5G SW who need to be compliant with timings with deadlines in the order of some microseconds.
This is why we’re trying to have the best possible solution to have less jitter on isolated CPU cores and lower latencies between CPU and GPU

CUDA kernels are many but they take short time (we use CUDA graph of course).

I think we would need to use the master-worker design you’re mentioning to have only one thread connected to the GPU. In this design all other threads should NEVER call cuda driver or just reduce the GPU call as much as possible? We’d like to minimize CUDA locks as much as possible, it would be great to have some guideline on this topic.

njuffa · September 13, 2023, 5:48pm

As far as I know, CUDA is not supported on top of any RTOS, and as I said, CUDA itself makes no guarantees regarding hard time limits on any particular operation. You are free to do as you wish, of course, but I do not think whatever is being contemplated is going to consistently meet hard real time requirements.

Correct. One thread and one thread only would be responsible for communicating with the GPU. All other threads would need to communicate with this master-worker thread. Any overhead for synchronization and locking would be completely under your control. As I stated, I can provide no assurances that this will lead to superior results compared to your currently envisioned design. I am suggesting it is worth examining this alternative as part of explorative design work early in a project.

Robert_Crovella · September 14, 2023, 6:56am

Yes, I think there may be some reasons to use a single thread to launching graphs in certain cases. certain graph usages have explicit warnings about (lack of) thread safety.

Yes, the CUDA runtime may use locks internally, for (host) threading purposes. This is not well documented AFAIK because I believe it is considered an implementation detail, subject to change in form and function, from one CUDA version to the next. Nevertheless it is observable in some cases and you can find posts on these forums that mention the observations.

Whether or not changing an application architecture to issue work to the same GPU from multiple vs. issuing work from a single thread would make a performance difference in your application is something I wouldn’t be able to address. You would probably need to run test cases to see if there is any difference. It is not normally something I would recommend that a CUDA programmer worry about unless they had already indicated or discovered a specific performance issue.

system · September 28, 2023, 6:56am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Too much time for kernel launch latency CUDA Programming and Performance	9	2439	November 28, 2022
Why is my single thread GPU speed 1000x faster than my CPU? CUDA Programming and Performance	14	4806	January 9, 2017
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9590	January 1, 2009
Specifics on performance CUDA Programming and Performance	7	2796	November 11, 2008
Multiple kernels in flight? CUDA Programming and Performance	19	26833	August 28, 2007
My first test on CUDA and some questions sync, thread with CUDA CUDA Programming and Performance	5	3023	November 13, 2007
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8616	December 18, 2008
Multiple GPU computing CUDA Programming and Performance	8	7878	May 7, 2008
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4198	May 13, 2010
Multiple threads calling CUDA API in parallel CUDA Programming and Performance cuda , driver , parallel-computing	4	219	August 9, 2024

Single or multiple CPU threads using same GPU?

Related topics