Establishing GPU processor and memory usage

This will work as long as you are in control of the source, which may very well be the case since most topics have already been beaten to death on various DSP forums and open source reference implementations exist of everything I can come to think of right now …

So why not turn the problem upside down and implement the core application with all the bread and butter algorithms one would need most of the time in cuda and let the cpu handle all the odd plugins that you have bought, found or written? I am undecided on whether the magic number is 64 threads like you suggest, or if it would be more convenient to go for 4 warps and 128 threads. It depends on the data layout where we could be either talking channels or samples and what the samplerate is in the latter case. For now, lets just say “multi”.

Something like a multi channel mixer with basic filtering, fx send and an all singing/all dancing compressor/limiter/de-esser on each channel perhaps even a phaser/flanger/short delay unit for individual coloration. Complement that with a (smallish) handful of convolution reverbs and perhaps three 64 … 128 voice multi channel synthesizers (say one analog modeling, one sample based and something from the FM and/or additive school) and you are set to go.

I believe a cuda project that size would fit quite comfortably on the latest batch of Apple portables with room for blinking-lights and real time waveform displays as well.

Who is in on a joint venture? I am working on the three synthesizers mentioned.

Actually, I’m working in POSIX so that’s not a problem (though that looks nasty… hopefully there’s a fix).

Come to think of it, that’s the only way it would work. You can’t really have plugins if you run a single kernel. (Although… perhaps you could have PTX plugins that you integrate through recompilation.)

You seem to know what effects are needed and where to get reference code. I know how to get them to get them to perform optimally in CUDA. We need someone who knows the applications to do the interfacing via APIs. And on the business side I can help too.

Yeah, the business side would be nice :-D

I am not sure how that is going to happen considering that since all of the quality implementation I know of (those with some sweat and blood in them) are mostly under the viral GPL, alternatively under BSD with attribution.

OTOH, the GPL covers the implementation of an idea, not the idea itself like a patent, so in principle we may go in the clear if the adaption requires a rewrite. I’ll have to ask some of the key authors on their stance on this issue to not make meself a “persona non grata” within the community.

Personally I see this project as a way to let off some steam and do things one would not normally be asked to do in ones average day job. The ultimate reward for me would be to say: "Yes, an application resembling the one I outlined (with some changes?) now exist and runs on any platform where there is an Nvidia driver.

There are some clockcycles to fill before we get there though.

I could use two collaborators on each of the three synthesizer branches without risking anyones ideas being canceled out by any one elses (since the three principles are so very different)

I could use one or two guys traversing the net once more, collecting all that has been said about compression and predicting that a signal is about to “go through the roof”, splitting the signal to do de-essing as well as magnetic tape saturation and perhaps a passable tube-overdrive. These effects are essentially all one and the same, only with slightly different parameters. We have the computing power to do the unified version.

For basic filters I would like a bland setup of bass/lo-mid/hi-mid/treble to describe the apparent distance of the sound source as well as the angle as heard in headphones. Now immediately I imagine several people objecting, saying that parametric eq is needed to cancel out bands from microphones that are destructively out of phase - and they would not be wrong. What I would do is to use the built in delay line - assuming we get it - to cancel out those bands, but I realize that this is a different workflow, and as of yet we may very well have clockcycles to burn to do both. So bring it on …

Features not covered yet;

Delay line, fx send , convolution reverb, group mixdown, syntheszers are layered or has separate midi interface … and probably a few other things I forgot.

It is late night in this part of the world and I need to sleep.

[continued]

There exist already one implementation of a cuda VST convolution reverb, but it has some major drawbacks: It is Windows only, there is no source available and it is stuck in the middle of development using - as far as I can judge from other users comments - a slightly suboptimal algorithm with even sized FFT’s rather than FFT’s of increasing length like the one used in Fons Adriaensens JConv:

Jconv is a Convolution Engine for JACK, based on FFT convolution and using non-uniform partition sizes: small ones at the start of the IR and building up to the most efficient size further on. It can perform zero-delay processing with moderate CPU load. …

So there you have your first reference implementation. Are you still with me?

jma, are you talking about implementing an audio host, i.e. a sequencer using Cuda? So integrating cuda at the most intimate level of the audio processing engine in order to get parallel processing across the mixer? Doing this, you’re getting into Apple, Steinberg and Ableton’s arena; I don’t think it’s worth trying to compete with a proper sequencer like Cubase, Logic and Live; you’d need this to appeal to a broad cross section of musicians. To do that you need to provide plugins that work in the existing hosts.

However, “Something like a multi channel mixer with basic filtering, fx send and an all singing/all dancing compressor/limiter/de-esser on each channel perhaps even a phaser/flanger/short delay unit for individual coloration”; this sounds like a channel strip from a mixer. And you could do this as a plugin for various hosts on various systems (like Jack, AU on Mac and VST on Mac/PC, see [url=“http://www.uaudio.com/products/software/cs1/index.html”]http://www.uaudio.com/products/software/cs1/index.html[/url] for the kind of thing already done on audio DSP cards). Writing a VST for Mac and PC to implement this wouldn’t be too hard.

Maybe a cuda plugin chainer VST plugin (where you have one host plugin that routes between multiple cuda plugins (compressor, eq etc) so you only go to/from the card once per plugin) could be devised if everything wasn’t combined in a channel strip.

The problem is I fear many audio DSP processes are quite difficult to do in parallel when you are taking in a block of samples from the host, processing them and putting them out again (having ruled out working across the channel mixer above). Example; an IIR filter has many feedback elements so each sample has to be processed in order; (I think) the opportunities to block process samples here is very limited, limited to a single kernel with a loop in it. If I am wrong here, somebody please correct me :) Processes involving decent flange/chorus typically use IIR filter structures to implement interpolation (as linear interpolation discolours the sound due to a high-cut filtering effect). I don’t yet know how digital compressors work, but they’re very difficult to get sounding good without analogue modelling of the circuits of classic gear.

Any FIR based approach (linear phase eq, reverb impulse response convolution) however is a good bet as there is no feedback and blocks of samples can be processed in parallel.

OK turning to convolution reverb, you were discussing this plugin which is for Windows VST: [url=“Convolution Reverb for NVidia and ATI GPUs - saving CPU time - Effects Forum - KVR Audio”]Convolution Reverb for NVidia and ATI GPUs - saving CPU time - Effects Forum - KVR Audio The problems with it mainly are that is isn’t terribly reliable (at least when I first tried it, just some crashes etc when you overloaded the gpu memory, not a show stopper), audio file format is limited (easy fix by using libsndfile), it has a GUI which you couldn’t sell as it can’t modify the sound very well (decent GUIs and supporting features take months to write as we all know) and that it has a massive input/output latency (8192 samples, 185ms at 44.1khz).

Getting low latency without making Cuda eat up CPU may be the big issue really, everything else is conceptually simple to solve. A decent software convolution reverb has zero input/output latency (using Gardner’s well known non-uniform length partition algorithm to which you refer). Any DSP card based solution has to block stuff up, send it to the card, process and get it back. It’s less efficient to transfer data the smaller the block, and you have less time to process the data the smaller the block (this is why I’ve been asking in other posts for people’s bandwidth tests on low block sizes).

Getting the latency down to acceptable levels for musicians (perhaps 256 samples) is a hurdle but I don’t believe it’s insurmountable as I’ve got some of the way there already (one problem is as latency goes down you transfer smaller and smaller blocks which takes a progressively larger and larger proportion of your total execution time which is also going down as the number of samples within them shrinks).

So it means the Cuda resource is being used quite inefficiently - as long as it doesn’t affect the CPU, the customer doesn’t care though and will buy a faster and faster card if they like the end result. Maybe they will increase their latency to get more out of the card, maybe they won’t (depending on their personal preference and usage scenario, a live user will want minimal latency, a user with plugins on an audio track that can be latency compensated by the host will opt for a big block size).

Now, the merits of using uniform vs non-uniform partitions for convolution reverb. I think using fixed partitions is best. Lets consider what you do with fixed partions (assume we are given a known number of samples every time and that the length of the impulse response has been padded as necessary).

For fixed blocks: input block xn, fft into Xn. Multiply Xn with H0. Xn-1 with H1 and so on for every block (must retain a history of X for as many blocks as are in H). Add all these up and take ifft to get the output y. Add the head of y to the tail of the previous y (if using overlap/add) and you have an output block. Now the process of multiplying all past X blocks with every block of H can be done in a 1-shot parallel operation. Highly efficient. The summation/reduction of all blocks into 1 output block can be done in log2(num blocks) parallel operations using a standard reduction, also fairly efficient.

Variable blocks: This is basically the same as for many fixed blocks but you have fewer blocks of H in each block group; the size of the blocks doubles every group. This means more FFTs but far less multiplication and addition stages, hence the computational saving.

In a uniform length block CPU based implementation the inefficiencies arise from the part where you sum the multiplications of Xs by Hs as many times as there are blocks. Doing the FFT and IFFT is quite cheap in comparison. So in the GPU based one I mention above where you do the multiplications in parallel 1 time and the additions in log2(num blocks) times this is no longer an especially expensive step because of the parallel nature of the device.

So, if you did a GPU solution using non-uniform length partitions you would end up doing log2(number blocks) FFTs and IFFTs, log2(number blocks) multiplication steps and log2(blocks) number of addition steps. So there’s no saving. So, I think it would be slower (I guess I’m making some assumptions about the time the very large multiplication and sum stages take). Again, this is an untested theory, smarter people, please correct me.

Plus the computation is now unevenly loaded which means every so often you have to do buckets of work which could end up causing problems for the early blocks which need to compete quickly to achieve low latency.

So, your reference implementation can just be a standard overlap/add convolution algorithm unless I’m overlooking something.

For those interested in this topic, what environments would you want to work in? If this is going to be a non-free effort I would recommend VST for Windows and AU for Mac and to forget about Linux for now.

Free or non-free is totally egal to me, but … no Linux? What am I to do now? :o

Anyways, regarding parallel IIR: Suppose we have the data of a single channel/plugin sideways across all threads. Then we could read in new samples from one end - possibly using a cached area like __constant for the argument from the host - and pass the result on to the neighboring thread, which will do the same until we reach some endpoint which will place the final result in a second __shared array untill filled, whereafter we fire a coalesced save to __global. At the end we will have an array of partially unfinished results, which we will save untill later - again coalesced - and load upon the next invocation of the plug.

That should work, I think?

edit: No wait a second, that is either a FIR or some monster multinode IIR? Perhaps I better go back and continue picking the low hanging fruits I have already started out with, and let you guys handle the commercial stuff External Image

Um, the idea here wasn’t to discourage you!

Before I potentially start a flame war, I just was recognising that most musicians, currently, choose the commercial Windows and Mac based sequencers as the Linux options are not as advanced yet (so to me they are of less interest, I am sure others would disagree and encourage me to contribute to those projects).