Are persistent kernels supported (now and in the future)?

Hi,
For a soft real-time application I would like to use a persistent kernel. I know I have to be careful with this to not stall a compete SM. But I think when doing it right it will make soft real-time kernels with low latency possible.

Now i heard that there is a time limit on kernel execution. But I can not find that in the specifications of CUDA.
And I also heard someone at GTC 2024 asking the question to a nVidia engineer if persistent kernels will stay supported in future versions of CUDA.

So my question is:
Can we use persistent kernels? And will that be possible for newer versions too?

2 Likes

Its reported in deviceQuery, which means it is a queryable property of the device you are running on. For GPUs in WDDM mode, it will typically, currently be reported as runtime limited I believe. You’ll note the property is listed as deprecated. I believe the reason for this deprecation is that more recent GPUs (Pascal and newer) have the ability to switch workloads (referred to as preemption). A specific reason given for preemption is to “harmonize” with GUIs. Therefore, in some circumstances, even under WDDM, or a linux GUI, its possible that kernels are no longer “runtime limited”.

People have been using persistent kernels for quite a while. I never heard that it was “not supported”. There is nothing (AFAIK) in the CUDA programming model that says that a kernel must eventually exit (and not anything in C++ either, that I am aware of, and CUDA claims compliance to a particular C++ standard, subject to various limitations and restrictions, none of which say anything about kernel duration).

Speaking for me, personally, I wouldn’t try to get a persistent kernel working in WDDM, for a number of reasons (1. in some settings, still, you may run into a kernel timeout 2. the preemption mechanism is going to put a crater in the CUDA processing, which means that fine-grained continuous processing is still not possible, there will be gaps where the GPU is busy doing something else). There are lots of caveats to using persistent kernels, and you can find discussions of them on various forum postings, and GTC presentations.

Regarding the future, you’ll have to draw your own conclusions. It’s not my role to share future plans, and AFAIK, NVIDIA generally doesn’t use these forums to make any sorts of guarantees about the future.

1 Like

On a Windows platform, the deviceQuery sample app of CUD 12.3 reports a GPU under control of WDDM as Run time limit on kernels: Yes, based on kernelExecTimeoutEnabled of cudaDeviceProp. A GPU under control of the TCC driver instead shows Run time limit on kernels: No.

There is a separate device property in cudaDeviceProp called computePreemptionSupported, and this seems to be always true on Pascal and later architectures. The deviceQuery sample app displays this as Device Supports Compute Preemption: Yes.

But how does a company like GPUAudio get their audio system working? If a kernel suddenly can be preempted because of things like a GUI, they can never get an audio result without glitches.
And it seems strange to me that a soft real-time system could be interrupted by a GUI. Is there a priority system so that we can set priority higher then GUI?

I am not familiar with GPUAudio. It seems to me that a GUI task pre-empting a compute kernel does not automatically result in an audio glitch. Glitches would be likely if such interruptions are too lengthy or too frequent, or both.

Not every GPU services a GUI. This is something users can control provided they have more than one GPU in the system. For example, on a Windows platform, one can select a professional GPU that is supported by the TCC driver, and therefore is never accessed by the GUI.

Unless advanced visualization is required, one can use a low-end cheap GPU to service the GUI, and a high-end professional GPU under control of the TCC driver to service CUDA compute kernels, then block CUDA from using the “GUI GPU” via CUDA_VISIBLE_DEVICES. I configured Windows-based workstations in exactly this way for many years.

When a single GPU is used for both GUI and compute kernels, it becomes a game of chance. The higher the performance of the GPU (and in some cases, the higher performance the CPU is whose task it is to “feed” the GPU), the lower the chance of missing a deadline in a GPU-accelerated soft real-time application. If the probability of a glitch is low enough such an application will appear to always work as desired, even though the absence of glitches is not guaranteed and cannot be guaranteed.

I do not know what kind of GPU-accelerated system / application you are contemplating, but I would encourage you to experiment to get a better feel for the likelihood of glitches under different load scenarios. I think this will yield more usable insights than the gedankenexperiments you are going through now.

Yes we will do some experiments soon.

One question though: I would like it to be easy for customers to use our system. An extra standaard GPU is an option. Thus not proffesional cards. Is it possible to free this extra GPU from other operating system tasks (like the GUI).
If the customer does not connect a monitor to that card, will that be sufficient? Can our software kind of claim the extra GPU card?

To my knowledge, and generally speaking, no. There are diverse configuration options for multiple GPUs under diverse operating systems. I am not able to enumerate them or provide a comprehensive overview. Maybe someone else can.

I cannot perceive any additional information being conveyed through this statement. It highlights a general problem with questions in this forum in that some askers are looking for specific, personalized advice on projects that they are willing to divulge almost nothing about. The better approach for such scenarios would likely be (IMHO) to engage a consultant with the right background at $200 an hour or whatever the going rate is right now.

Well, it is not that I do not want to tell what our plans are. I just don’t want to make the questions more complex and try to get to the point.
And we are in a kind of investigation phase. So plans and specs may change.

We are trying to find out if we can add GPU support to an audio plugin that we plan to make. We would like a low latency audio plugin. Our sample block size is about 48 samples minimal. At 48KHz sample rate this corresponds to 1 ms.
We can overlap transfers and processing. But both transfer and processing need to be done within this 1 ms.
Until now we figured out that a contineous running kernel and zero-copy (mapped) memory can result in this kind of low latency.
But a GUI task that preempts our kernels would definitely makes this impossible.

Last I checked, NVIDIA does not produce any GUI. You can pull up the documentation of the operating system(s) of your choice and research what methods of controlling GUI behavior they offer. One relevant search term would be “watchdog timer”.

I already pointed out one way in which GUI interference can be avoided altogether on Windows. If you are looking at Linux instead: running a system without any GUI such as X is certainly possible and I have used such systems for CUDA-accelerated applications more than once.

I’m not sure how you reached that conclusion.

I’m not sure how you reached that conclusion.

If you reached the first conclusion by testing to some degree, then what environment did you test in? Was it on a windows WDDM GPU? If so, then you have already proven (according to some conditions) that the 2nd conclusion is false or not supported in all cases.

It’s almost certain that any sort of interactive system on windows (especially involving sharing a WDDM GPU,) will require some testing within some fences/boundaries, in order to have a sense that things may be generally workable (or not). I think you can probably take almost any application that runs on windows, and demonstrate a machine configuration or setup that appears to make that application run miserably. Heck, windows GUI itself can pretty much fall over when it gets starved due to excessive application resource usage. It’s the nature of the beast: an application platform that you can load almost any software on, that runs on a huge range of underlying hardware, without hard real-time constructs in the OS. Windows by definition is a cooperative multi-tasking environment, and to expect that a rogue actor can be tolerated without any impact to QoS for other uses, unbearably stretches plausibility, in my view.

I was doing some streaming video work (OBS) recently, and some of our efforts were disrupted by chrome, of all things. Chrome is a pig. (Probably, OBS is a pig.) And so we learned “don’t run chrome, at all, while we are streaming”. Not because chrome is always bad, but because sometimes, under some circumstances, chrome “misbehaves”, according to my definition. You might have to instruct your users, to some degree, about what else can be done in this environment, alongside your application. Whether you use a GPU or not. Or they will find out on their own.

Please don’t construe anything I say to mean “you cannot possibly do what you want” unless I say that in so many words. You ask questions; I try to help by telling you where the landmines are. With some effort, you may be able to avoid those landmines, knowing where they are.

1 Like

Well, I did ask questions on forums and discord and I found other threads on this forum, blogs and articles from which I concluded that what we want to do can probably be done if (and only if) we can run some kernel persistently.

So persistence is my main concern.

I hope nVidia will release some functionality that will support this kind of soft real-time systems. But I guess this is wishful thinking.

But thanks a lot for telling about the landmines. I hope to start testing soon.

I don’t understand the specifics of the run time limit on kernels. If it applies only to the time a kernel is resident in an SM, then a workaround may be to have your kernel tail-launch itself indefinitely.

It seems like that approach would be easy to test.

On the other hand, the limit might apply to the host launch and all of its effects on the stream. I don’t know.