I created an application that processes audio data in realtime. To reduce the latency until you can hear the result, I can even process with blocks of 64 or 128 samples. For 128 samples and stereo audio at 44100hz, this results in 1024 bytes to be transferred about 344 times a second. This is a transfer size of about 345kb per second, which is not much datawise, but much in the number of transfer calls.
The problem is, that it takes very much time to do the transfer. For one block, the time to transfer a block to the GPU and back to the CPU is about one millisecond, which is really too much in my opinion.
I’m already doing it the way NVIDIA suggests in their paper with alloc_host_ptr and such stuff. And I don’t expect a solution to be found because I think these are hardware limits.
For me it seems, that GPUs at the moment aren’t well-suited to receive many small blocks of data, they perform much better if receiving larger blocks. But for realtime audio, this is a no-go.
I was wondering why other dedicated DSP solutions for realtime audio don’t have this problem. Things like UAD, TC Powercore or others even work on the slower PCI bus without any problems, but the faster PCIe bus together with a GPU really just doesn’t perform as expected.
This seems to be a general problem, as even NVIDIA suggests to reduce data transfer whenever possible, but WHY? What is the bottleneck in the signal path?
Hoping for a nice discussion.
PS: GF 9800 GT PCIe, Core 2 Quad Q6600 @ 2,71Ghz, Asus P5Q, 4GB RAM