Why is PCIe data transfer so absolutely slow? Discuss please!

Hello,

I created an application that processes audio data in realtime. To reduce the latency until you can hear the result, I can even process with blocks of 64 or 128 samples. For 128 samples and stereo audio at 44100hz, this results in 1024 bytes to be transferred about 344 times a second. This is a transfer size of about 345kb per second, which is not much datawise, but much in the number of transfer calls.

The problem is, that it takes very much time to do the transfer. For one block, the time to transfer a block to the GPU and back to the CPU is about one millisecond, which is really too much in my opinion.

I’m already doing it the way NVIDIA suggests in their paper with alloc_host_ptr and such stuff. And I don’t expect a solution to be found because I think these are hardware limits.

For me it seems, that GPUs at the moment aren’t well-suited to receive many small blocks of data, they perform much better if receiving larger blocks. But for realtime audio, this is a no-go.

I was wondering why other dedicated DSP solutions for realtime audio don’t have this problem. Things like UAD, TC Powercore or others even work on the slower PCI bus without any problems, but the faster PCIe bus together with a GPU really just doesn’t perform as expected.

This seems to be a general problem, as even NVIDIA suggests to reduce data transfer whenever possible, but WHY? What is the bottleneck in the signal path?

Hoping for a nice discussion.

Regards,

Nils

PS: GF 9800 GT PCIe, Core 2 Quad Q6600 @ 2,71Ghz, Asus P5Q, 4GB RAM

This here gets into a bit of a fun discussion, but frankly, it comes down to the bandwidth of your PCIe. Obviously, PCI can be around 8GB/s, but I think that’s a stream of data uninterrupted. You are doing lots of individual transfers to and from the GPU over the bus. This, if you were to profile your code, would probably show lots of time when your GPU is doing nothing.

My master’s thesis was on using GPUs to accelerate computational fluid dynamics calculations - what I found was that by minimizing data transfers, I was able to improve the solution time of my algorithm from 10-20 minutes depending on the case to 40-80 seconds.

The GeForce graphics cards have only a single memory controller. This means that the data gets sent from the RAM to the GPU, the memory controller on the GPU processes and stores the memory, then the GPU processors kick in and do the work and store the results, then the memory controller sends the data back to the RAM, and the process repeats. The new Tesla architecture I believe has two memory controllers (so you can transfer and compute simultaneously), but you’re algorithm is really trying to use current graphics hardware in ways it’s not designed to.

You could probably double the performance of your code by ‘blocking up’ your memory … maybe group together sets of samples (say 16 at a time). I don’t know how it’ll affect the ‘real-time-ness’ of your code, but you could try it out.

If you wanted the best performance, you’ll need to upgrade your hardware (Xeon, i5/i7 chips have very good bandwidth on the PCI-e lanes compared to the Core2Duo chips).

I found that GPU Impulse Reverb does use OpenCL for real-time processing; however, this area is not my speciality, and therefore I cannot answer why dedicated DSPs are good for your problem (other than that their code is probably optimized to death and they have multiple memory controllers (one per channel, perhaps) ). My feeling is that real-time audio signal processing is possible, but it just requires a bit more tweaking of your algorithm.

In summary - current GPU hardware is limited by the memory controller. NVIDIA and AMD both know this limitation on their hardware, and are likely working to resolve it (though NVIDIA will save it for their Tesla products for a while for one reason: money).

Now, if you have a multi-core CPU, with OpenCL, you can easily use that to improve performance and eliminate the bottlenecks. As I’ve told my co-workers several times: GPU is not the ready solution to every single computing problem in the world. It has its pros and cons, just like everything else.

Hoping for some more feedback :)

For me, there is no possibility to reduce the transfer any further. I send and receive a whole block of audio data at once, always the size that I need to process so the listener gets his result back immediately. So for a block size of 128 samples, which gives 3ms of latency until you hear something, I send 2048 bytes (2 channels of audio data with a size of 128 floats each).

Performance-wise, a convolution algorithm fits very good on a GPU, but the latency is still there if I do nothing on the GPU. This is why I’m writing here, because I only measured data transfers without any processing involved for testing, so my GPU idles. If I compute something, there is nearly no difference in latency.

It sounds like your problem is just hardware, then. You’re running up against the difficulty of using a single memory controller sending and receiving data.

You could try using pinned memory (I have not used it but I hear it can be used to improve speeds), as discussed in the last post here: MAGMA forum

This was also discussed previously http://forums.nvidia.com/index.php?showtopic=167659

There’s a small paper about this here: http://www.idi.ntnu.no/~elster/master-studs/runejoho/ms-proj-gpgpu-latency-bandwidth.pdf

This is also a good PDF discussing more general applications of GPUs outside of graphics: http://www.cs.utah.edu/~wbsun/kgpu.pdf

I wish we had something more fun to talk about, but we’re talking hardware issues, not software issues, and aside from using pinned memory, there’s not much we can do that is easy and clean :)