I have built a complex audio synthesiser that does its major processing on the GPU using OpenCL.
Audio programs may typically be run with buffers of 1024 audio samples at 44100 samples/second audio. ie. 23 ms is available to real time process each audio buffer of 1024 samples.
The synth features:
- Several small float/int memory updates to the GPU with new data (insignificant).
- A Kernel to process the audio block.
- A few small reads from the GPU after (ie. each of the six workgroup’s 1024 samples output).
The Kernel is run with typically up to 6 workgroups of 256 workitems (threads). The Kernel features barrier
calls inside for each workgroup to stay synchronized inside itself but still this is processing reasonably quickly. I am queuing everything each buffer and then running the finish
command as I have found that is faster.
It takes about 4-5 ms to run the full OpenCL sequence on each audio block per synth (ie. plenty of time for the 23 ms buffer). This time appears to be all spent on the kernel as the memory transfers seem insignificant. This is great, but I am finding I am running into bottlenecks when I have multiple synths running in parallel. I am not sure why.
The synthesiser is used in a typical audio recording program like Cubase or Reaper, where the user may set up many different synthesiser tracks. Each will be different instances of the synthesiser with thus different GPU queues and typically run by the program on different CPU threads.
What is interesting to me is I have now tested two NVIDIA GPU’s and I am not sure the results are proportional or where the bottleneck is.
According to CUDA C++ Programming Guide & CUDA GPUs - Compute Capability | NVIDIA Developer :
- RTX 3070 - 5888 cores, 1500 MHz, ??? streaming multiprocessors, compute capability 8.6 = 128 concurrent kernels.
- RTX 4090 - 16384 cores, 2235 MHz, 128 streaming multiprocessors, compute capability 8.9 = 128 concurrent kernels.
Roughly (preliminary real world testing) I am getting around 2 instances of the synth on the RTX 3070, and up to around 5 on the RTX 4090 before playback starts stuttering.
I am not sure why this is happening, and would appreciate any hypothetical speculation on the roadblock. Perhaps I don’t understand how NVIDIA GPU’s work.
If both can handle up to 128 concurrent kernels, and each full kernel + memory commands total only takes 4-5 ms, then there should certainly be no lag on either from running just 2-5+ kernels concurrently.
Additionally, at most 4-5 instances of 1536 threads per kernel, I am only using at most 7680 cores on the 4090 before I start stuttering. With 16384 cores available, I am certainly not running out of cores yet
I feel like the main benefit I got going from the RTX 3070 to the 4090 is just from the clock speed boost plus maybe some other minor hardware optimizations. I think I am hitting some type of early bottleneck with both (and certainly with the 4090 given its power) but I’m not sure what. I feel like I should be doubling the synthesiser instance count I am getting before stuttering on the 4090.
Just based on what I have said, is there any obvious barrier I could be running into on the NVIDIA GPU’s? I suspect I am not getting truly parallel execution of the synth instances past the first 2 instances on the 3070 and 4-5 instances on the 4090 and that is why I am stuttering.
If they are being executed in parallel and each full queue only taking 4-5 ms to complete per audio buffer, there should be no GPU-related reason to stutter or lag. But this is not the result.
I am not running out of cores or hitting parallel kernel limits. So what limit or bottleneck might it be?
Any thoughts or speculation, ideas? Thanks for any help or clues.