We’ve a single process/multi thread application with two threads running in isolated cores creating/updating/launching CUDA graphs to a single GPU.
Both threads are running in parallel and can launch graphs at same time. In my understanding GPU is able to allocate kernels in parallel from both threads (while a multi-process SW would need MPS or kernels would be serialised), which is great. Is this the best solution for a real-time low-latency application or do you suggest to have a single thread launching graphs? Do you have some lock mechanism on GPU threads? If having multiple CPU threads is forcing CUDA driver to use locks we can try to change our architecture
How stringent are your real-time requirement? I assume this is an application with soft real-time requirements, i.e. it can tolerate the occasional missing of a deadline (for example, by dropping a frame if the context is video processing)? The entire CUDA software stack is based on a “best effort” model: there are no latency guarantees that would be needed for a hard real time use case.
In terms of application-level latency I would be most concerned about the run-time of launched kernels across the entire range of GPU hardware you intend to support.
An alternative to your current equal-worker design would be a master-worker design, in which one thread communicates with the GPU and all other threads communicate with the master-worker thread.
The reason this might be worth exploring is that latency-sensitive work such as coordinating access to a shared resource might be better targeted at the latency-optimized portion of the overall system, and that is generally the CPU (provided you use a fast CPU; my recommendation would be one with >= 3.5 GHz base clock). But I think it is hard to predict total system behavior knowing nothing about your setup. I would therefore would advocate experimentation within the context of your specific use case.
Our application has hard real-time requirements, it’s a 5G SW who need to be compliant with timings with deadlines in the order of some microseconds.
This is why we’re trying to have the best possible solution to have less jitter on isolated CPU cores and lower latencies between CPU and GPU
CUDA kernels are many but they take short time (we use CUDA graph of course).
I think we would need to use the master-worker design you’re mentioning to have only one thread connected to the GPU. In this design all other threads should NEVER call cuda driver or just reduce the GPU call as much as possible? We’d like to minimize CUDA locks as much as possible, it would be great to have some guideline on this topic.
As far as I know, CUDA is not supported on top of any RTOS, and as I said, CUDA itself makes no guarantees regarding hard time limits on any particular operation. You are free to do as you wish, of course, but I do not think whatever is being contemplated is going to consistently meet hard real time requirements.
Correct. One thread and one thread only would be responsible for communicating with the GPU. All other threads would need to communicate with this master-worker thread. Any overhead for synchronization and locking would be completely under your control. As I stated, I can provide no assurances that this will lead to superior results compared to your currently envisioned design. I am suggesting it is worth examining this alternative as part of explorative design work early in a project.
Yes, I think there may be some reasons to use a single thread to launching graphs in certain cases. certain graph usages have explicit warnings about (lack of) thread safety.
Yes, the CUDA runtime may use locks internally, for (host) threading purposes. This is not well documented AFAIK because I believe it is considered an implementation detail, subject to change in form and function, from one CUDA version to the next. Nevertheless it is observable in some cases and you can find posts on these forums that mention the observations.
Whether or not changing an application architecture to issue work to the same GPU from multiple vs. issuing work from a single thread would make a performance difference in your application is something I wouldn’t be able to address. You would probably need to run test cases to see if there is any difference. It is not normally something I would recommend that a CUDA programmer worry about unless they had already indicated or discovered a specific performance issue.