Would this mechanism work for real-time audio processing?

I need low latency real-time processing of audio data. The kernels need to work on audio samples each 1 ms! Therefore I would like to start a block of kernels and then make them wait for new data to be processed.
The host then should write new data to a shared memory and the kernels should detect this and start to process the data. When done the host should notice this and the kernels should wait for new data again.

To make kernels wait for new data I was thinking of polling some address. When the host writes some boolean flag, the kernels should detect this. Maybe I can use atomics for this. When a kernel reads global memory it will stall for a few hundred cycles so other kernels can go in between. I think that is a reasonable polling mechanism. But is it possible?

Can I somehow make sure a kernel actually tries to read the global memory each time and not the cache (during the polling)?

I do not want to use command queues with memory transfers and kernel invocations because that is probably not real-time safe.

First of all, CUDA does not provide any strong real-time guarantees. If you need them, CUDA may not be suitable for you.

The concept you are looking for is called persistent kernels. This may help you find other resources.
To avoid reading data from cache, you can use volatile pointers.

completely agree.

In addition, I don’t think there is any actual guarantee that a persistent kernel remains “persistent”. Specifically, my understanding of how things work in an environment where you are also using the graphics engine is that the compute kernel may get pre-empted (moved out, so to speak) for execution of graphics workload, from time to time. There is a work scheduler on the GPU that will attempt to make forward progress both on the graphics task as well as the compute task, but this will almost certainly introduce further variability in the “response time” to new data being delivered to the compute kernel, as compared to the case where the persistent kernel is running on an otherwise idle GPU. AFAIK there is no detailed specification for this scheduler behavior.

The GPU control panels on windows or linux may give some high-level inputs as to how to order or prioritize work, but this stuff tends to change from time to time, and I have not looked at it recently.