What might be my primary GPU processing bottleneck in this scenario? (Why is RTX 4090 not performing well?)

I have built a complex audio synthesiser that does its major processing on the GPU using OpenCL.

Audio programs may typically be run with buffers of 1024 audio samples at 44100 samples/second audio. ie. 23 ms is available to real time process each audio buffer of 1024 samples.

The synth features:

  1. Several small float/int memory updates to the GPU with new data (insignificant).
  2. A Kernel to process the audio block.
  3. A few small reads from the GPU after (ie. each of the six workgroup’s 1024 samples output).

The Kernel is run with typically up to 6 workgroups of 256 workitems (threads). The Kernel features barrier calls inside for each workgroup to stay synchronized inside itself but still this is processing reasonably quickly. I am queuing everything each buffer and then running the finish command as I have found that is faster.

It takes about 4-5 ms to run the full OpenCL sequence on each audio block per synth (ie. plenty of time for the 23 ms buffer). This time appears to be all spent on the kernel as the memory transfers seem insignificant. This is great, but I am finding I am running into bottlenecks when I have multiple synths running in parallel. I am not sure why.

The synthesiser is used in a typical audio recording program like Cubase or Reaper, where the user may set up many different synthesiser tracks. Each will be different instances of the synthesiser with thus different GPU queues and typically run by the program on different CPU threads.

What is interesting to me is I have now tested two NVIDIA GPU’s and I am not sure the results are proportional or where the bottleneck is.

According to CUDA C++ Programming Guide & CUDA GPUs - Compute Capability | NVIDIA Developer :

  • RTX 3070 - 5888 cores, 1500 MHz, ??? streaming multiprocessors, compute capability 8.6 = 128 concurrent kernels.
  • RTX 4090 - 16384 cores, 2235 MHz, 128 streaming multiprocessors, compute capability 8.9 = 128 concurrent kernels.

Roughly (preliminary real world testing) I am getting around 2 instances of the synth on the RTX 3070, and up to around 5 on the RTX 4090 before playback starts stuttering.

I am not sure why this is happening, and would appreciate any hypothetical speculation on the roadblock. Perhaps I don’t understand how NVIDIA GPU’s work.

If both can handle up to 128 concurrent kernels, and each full kernel + memory commands total only takes 4-5 ms, then there should certainly be no lag on either from running just 2-5+ kernels concurrently.

Additionally, at most 4-5 instances of 1536 threads per kernel, I am only using at most 7680 cores on the 4090 before I start stuttering. With 16384 cores available, I am certainly not running out of cores yet

I feel like the main benefit I got going from the RTX 3070 to the 4090 is just from the clock speed boost plus maybe some other minor hardware optimizations. I think I am hitting some type of early bottleneck with both (and certainly with the 4090 given its power) but I’m not sure what. I feel like I should be doubling the synthesiser instance count I am getting before stuttering on the 4090.

Just based on what I have said, is there any obvious barrier I could be running into on the NVIDIA GPU’s? I suspect I am not getting truly parallel execution of the synth instances past the first 2 instances on the 3070 and 4-5 instances on the 4090 and that is why I am stuttering.

If they are being executed in parallel and each full queue only taking 4-5 ms to complete per audio buffer, there should be no GPU-related reason to stutter or lag. But this is not the result.

I am not running out of cores or hitting parallel kernel limits. So what limit or bottleneck might it be?

Any thoughts or speculation, ideas? Thanks for any help or clues.

According to the TechPowerUp database, and RTX 4090 provides about 2.4x the performance of an RTX 3070, so 2 instances on the RTX 3070 vs 5 on the RTX 4090 seems plausible.

Does stuttering imply violating a latency constraint? GPUs are designed for high throughput, not low latency, so latency constrained use cases may not be a good match for the GPU.

Are you able to use NVIDIA’s profilers (Nsight System, Nsigth Compute) with OpenCL? My knowledge of OpenCL is that it exists. These profilers should allow you to identify the bottleneck(s). Any particular reason you chose OpenCL instead of CUDA for developing this code?

Speaking in generalities, any system that involves (1) shared resources (2) queueing / buffering will show a noticeable increase in latency as throughput approaches the maximum.

One example are highways. U.S. 101 through central San Jose sports 4 lanes per direction. If I recall the numbers correctly, the theoretical throughput of these four lanes is 14,000 vehicles per hour. One could presumably get very close to this by forming a convoy in which every vehicle drives exactly the same speed, with no vehicles entering or exiting the highway. The highest throughput measured by the DOT is a bit under 12,000 vehicles per hour. From personal observation, the average speed at times of maximum achieved throughput is in the 20 mph to 25 mph range. Similar latency effects are seen in network routers. And they could affect this use case involving GPU acceleration.

What data is this assessment based on?