You need to have each card run in a separate thread and split the data that is processed between the threads. You will also probably need to have a step to reduce the outputs from the various threads into a final result, unless each thread can be constructed to work on an independent problem.
The CUDA side of this is actually pretty easy, you just have each thread setup a different card and then call your kernel as normal. The hard part is setting up the threads and data synchronization. This is a pretty detailed subject. If you aren’t already comfortable with the idea then I would suggest looking at some good books. I’m primarily an Ada programmer when it comes to Realtime and concurrent processing issues so I found the following to be a great reference:
Concurrent and Realtime Programming in Ada by Alan Burns and Andy Wellings
Since I know the chances of you being an Ada programmer are about nil unless you are in the Aerospace or Defense fields, here’s a couple of works referenced in this book that might be a bit more generic:
Concurrent Programming: Principles and Practice by Greg Andrews
Real Time Computing by Wolfgang A. Halang and Alexander D. Stoyenko
and of course there is also the Dinosaur book on OSes, which if you’ve gotten this far you would do well to look into. It covers basic usage of semaphores and threads amongst a host of other OS topics. I’ve never met a CmpE or CS major that didn’t use some edition of this book:
Operating System Concepts by Abraham Silberschatz, Peter B. Galvin and Greg Gagne
There’s about 100 editions of this book, this happens to be the latest (and obviously most expensive), but if you are using any Unix based system (including newer Macs) than practically any edition should be fine for the basic concepts. New editions also cover some aspects of Windows.
These are professional level and foundation type books. If you want something more shake and bake I’m sure there are tons of other books out there and there is a sticky thread about parallel programming references.
There is also a Multi-GPU example in the SDK that is worth looking at, but I caution that race conditions and corrupt data are just about guaranteed if you try to modify examples too much without really understanding the basic principles involved.
I hope I haven’t made this seem overwhelming, because honestly I don’t think it is. Of course maybe I’m strange in that I find this type of programming to be pretty interesting.
Also if you are looking into CUDA to speed things up the concepts in the above books will allow you scale up to not just multiple GPU’s in a single machine, but entire clusters of machines. So the rewards are well worth the investment. And while CUDA is likely to stay around for a few more years and then be supplanted, sockets and thread synchronization haven’t changed significantly in ages.
Best of luck,