Testing App. on multiple GPU´s

I have written a Cuda based program to calculate Efficient frontier (Optimimal portfolio) based on Markowitz theory.
http://en.wikipedia.org/wiki/Capital_asset_pricing_model

It is allready damn fast, but I cannot test the scalability over several GPU´s (it is written for this) since I only have a single GPU, and cannot afford a new motherboard right now.
So I am looking for persons that can test this for me on computers with multiple GPU´s the more the merrier, the more powerfull the merrier :)

I can send the program, it will produce console output with timings. It will by default use all the GPU´s present, but has flag for 1 GPU only.
What I hope is to get the console output back with timings for 1 GPU and all available.

Crossing my fingers for a kind soul somewhere out there.

Kind regards
Peter

As long as it is easy to compile and run on Linux, I should be able to help you out. I have a Core i7 2.66 GHz sitting on my desk with three GTX 295 cards and a GTX 470 install, for a total of 7 CUDA devices.

To give you some idea of the I/O topology: The motherboard uses the X58 chipset and two NF200 chips to multiplex 32 PCI-Express 2.0 lanes over four PCI-E x16 2.0 slots. Internally the GTX 295 cards also use a NF200 to multiplex two devices onto each PCI-Express slot. So every CUDA device has individual access to the full host<->device bandwidth of PCI-E 2.0 (about 6 GB/sec), but simultaneous access from multiple devices will eventually saturate the QPI link to the CPU.

As long as it is easy to compile and run on Linux, I should be able to help you out. I have a Core i7 2.66 GHz sitting on my desk with three GTX 295 cards and a GTX 470 install, for a total of 7 CUDA devices.

To give you some idea of the I/O topology: The motherboard uses the X58 chipset and two NF200 chips to multiplex 32 PCI-Express 2.0 lanes over four PCI-E x16 2.0 slots. Internally the GTX 295 cards also use a NF200 to multiplex two devices onto each PCI-Express slot. So every CUDA device has individual access to the full host<->device bandwidth of PCI-E 2.0 (about 6 GB/sec), but simultaneous access from multiple devices will eventually saturate the QPI link to the CPU.

Seibert, thanks for your kind offer
However it is win 32/64 version at present - so no Unix. Too bad I would have loved to see the results on your computer with the 3 295 cards.

Cheers,
Peter

Seibert, thanks for your kind offer
However it is win 32/64 version at present - so no Unix. Too bad I would have loved to see the results on your computer with the 3 295 cards.

Cheers,
Peter