jma, are you talking about implementing an audio host, i.e. a sequencer using Cuda? So integrating cuda at the most intimate level of the audio processing engine in order to get parallel processing across the mixer? Doing this, you’re getting into Apple, Steinberg and Ableton’s arena; I don’t think it’s worth trying to compete with a proper sequencer like Cubase, Logic and Live; you’d need this to appeal to a broad cross section of musicians. To do that you need to provide plugins that work in the existing hosts.
However, “Something like a multi channel mixer with basic filtering, fx send and an all singing/all dancing compressor/limiter/de-esser on each channel perhaps even a phaser/flanger/short delay unit for individual coloration”; this sounds like a channel strip from a mixer. And you could do this as a plugin for various hosts on various systems (like Jack, AU on Mac and VST on Mac/PC, see [url=“http://www.uaudio.com/products/software/cs1/index.html”]http://www.uaudio.com/products/software/cs1/index.html[/url] for the kind of thing already done on audio DSP cards). Writing a VST for Mac and PC to implement this wouldn’t be too hard.
Maybe a cuda plugin chainer VST plugin (where you have one host plugin that routes between multiple cuda plugins (compressor, eq etc) so you only go to/from the card once per plugin) could be devised if everything wasn’t combined in a channel strip.
The problem is I fear many audio DSP processes are quite difficult to do in parallel when you are taking in a block of samples from the host, processing them and putting them out again (having ruled out working across the channel mixer above). Example; an IIR filter has many feedback elements so each sample has to be processed in order; (I think) the opportunities to block process samples here is very limited, limited to a single kernel with a loop in it. If I am wrong here, somebody please correct me :) Processes involving decent flange/chorus typically use IIR filter structures to implement interpolation (as linear interpolation discolours the sound due to a high-cut filtering effect). I don’t yet know how digital compressors work, but they’re very difficult to get sounding good without analogue modelling of the circuits of classic gear.
Any FIR based approach (linear phase eq, reverb impulse response convolution) however is a good bet as there is no feedback and blocks of samples can be processed in parallel.
OK turning to convolution reverb, you were discussing this plugin which is for Windows VST: [url=“Convolution Reverb for NVidia and ATI GPUs - saving CPU time - Effects Forum - KVR Audio”]Convolution Reverb for NVidia and ATI GPUs - saving CPU time - Effects Forum - KVR Audio The problems with it mainly are that is isn’t terribly reliable (at least when I first tried it, just some crashes etc when you overloaded the gpu memory, not a show stopper), audio file format is limited (easy fix by using libsndfile), it has a GUI which you couldn’t sell as it can’t modify the sound very well (decent GUIs and supporting features take months to write as we all know) and that it has a massive input/output latency (8192 samples, 185ms at 44.1khz).
Getting low latency without making Cuda eat up CPU may be the big issue really, everything else is conceptually simple to solve. A decent software convolution reverb has zero input/output latency (using Gardner’s well known non-uniform length partition algorithm to which you refer). Any DSP card based solution has to block stuff up, send it to the card, process and get it back. It’s less efficient to transfer data the smaller the block, and you have less time to process the data the smaller the block (this is why I’ve been asking in other posts for people’s bandwidth tests on low block sizes).
Getting the latency down to acceptable levels for musicians (perhaps 256 samples) is a hurdle but I don’t believe it’s insurmountable as I’ve got some of the way there already (one problem is as latency goes down you transfer smaller and smaller blocks which takes a progressively larger and larger proportion of your total execution time which is also going down as the number of samples within them shrinks).
So it means the Cuda resource is being used quite inefficiently - as long as it doesn’t affect the CPU, the customer doesn’t care though and will buy a faster and faster card if they like the end result. Maybe they will increase their latency to get more out of the card, maybe they won’t (depending on their personal preference and usage scenario, a live user will want minimal latency, a user with plugins on an audio track that can be latency compensated by the host will opt for a big block size).
Now, the merits of using uniform vs non-uniform partitions for convolution reverb. I think using fixed partitions is best. Lets consider what you do with fixed partions (assume we are given a known number of samples every time and that the length of the impulse response has been padded as necessary).
For fixed blocks: input block xn, fft into Xn. Multiply Xn with H0. Xn-1 with H1 and so on for every block (must retain a history of X for as many blocks as are in H). Add all these up and take ifft to get the output y. Add the head of y to the tail of the previous y (if using overlap/add) and you have an output block. Now the process of multiplying all past X blocks with every block of H can be done in a 1-shot parallel operation. Highly efficient. The summation/reduction of all blocks into 1 output block can be done in log2(num blocks) parallel operations using a standard reduction, also fairly efficient.
Variable blocks: This is basically the same as for many fixed blocks but you have fewer blocks of H in each block group; the size of the blocks doubles every group. This means more FFTs but far less multiplication and addition stages, hence the computational saving.
In a uniform length block CPU based implementation the inefficiencies arise from the part where you sum the multiplications of Xs by Hs as many times as there are blocks. Doing the FFT and IFFT is quite cheap in comparison. So in the GPU based one I mention above where you do the multiplications in parallel 1 time and the additions in log2(num blocks) times this is no longer an especially expensive step because of the parallel nature of the device.
So, if you did a GPU solution using non-uniform length partitions you would end up doing log2(number blocks) FFTs and IFFTs, log2(number blocks) multiplication steps and log2(blocks) number of addition steps. So there’s no saving. So, I think it would be slower (I guess I’m making some assumptions about the time the very large multiplication and sum stages take). Again, this is an untested theory, smarter people, please correct me.
Plus the computation is now unevenly loaded which means every so often you have to do buckets of work which could end up causing problems for the early blocks which need to compete quickly to achieve low latency.
So, your reference implementation can just be a standard overlap/add convolution algorithm unless I’m overlooking something.
For those interested in this topic, what environments would you want to work in? If this is going to be a non-free effort I would recommend VST for Windows and AU for Mac and to forget about Linux for now.