Are you sure you need a GPU to do your job? From your specs it seems you’d need a shared memory architecture with cacheing capabilities, like real DSP (see Ti DaVinci family to have an idea…) as the main issue with CUDA is reducing memcpys to and from the GPU itself. I suppose you should have a lot of memory transfers back and forth between Host and Device, as you have to work in realtime and you should make so much buffering at all… it’s a common sense about RT apps.
In such a scheme it would be more useful to port to GPU generative processes (like ODE/PDE integrators for physics simulations), or streaming computations (like large transforms, FFTs, DCT/DWT) on large datas. In the former case you can limit the data flow to a few needed coefficients, reducing memory overhead and coalescence/collision critical issues. In the latter, if your plan is well suited, you can obtain the best performances by instancing many threads-per-sec than any CPU available all around.
CUDA coding issues are the reason we’re on this forum >.< , so feel free to ask anything you need. BYE!