Looking for CUDA apps that can use more than 1 GPU.

I am looking for CUDA apps that can use more than 1 GPU. I have a GTX 280, and a GTX 295. I would love to find some apps that will use both sides of my 295, and 280 all at the same time.

If you know of any, please post up a link. :thumbup:

This is the only one I can find so far, and it is still in development.

The CUDA Factorial Benchmark from Total-OC.ru

FYI - If you are working on one, and could use a BETA Tester, I stand ready to serve. I am a CUDA and PhysX fan!


Still on the hunt for an OpenCL or CUDA app that will use all 3 of my GPU’s at the same time.

I am getting EVGA to help on the hunt too.


"Where are the OpenCL Benchmarks?

I keep seeing more and more videos, but nothing I can download and check my FPS.

Can we get these to try out?


Both Nvidia and ATI can now run OpenCL on the latest drivers correct?

We need a benchmark bad.

I still hold the hopes that OpenCL will use all 3 of my GPU’s.

CUDA can, but has yet to produce an app that will use my 295 and 280 at the same time…

Or if an app/benchmark was made that will do so, I have never found it".

#11 from the CUDA 2.1 faq.

Does CUDA support multiple graphics cards in one system?

Yes. Applications can distribute work across multiple GPUs. This is not done automatically, however, so the application has complete control.

The Tesla S1070 Computing System supports 4 GPUs in an external enclosure:


There are also motherboards available that support up to 4 PCI-Express cards.

It is also possible to buy pre-built Tesla “Personal SuperComputer” systems off the shelf:


See the “multiGPU” example in the CUDA SDK for an example of programming multiple GPUs.

I wish a programmer would offer up a download link to multiGPU app that runs on Windows 7.

You can’t have been looking too hard. HOOMD is a serious molecular dynamics simulation environment with excellent multi-gpu support. But I doubt that is what you really want.

There is also another issue: running displays on GPUs imposes hard computation time limits on CUDA which don’t exist for dedicated compute cards. As a result, most codes are intended to run on dedicated compute cards which are not under the control of an active display driver. That would probably mean running on both halves of your GTX-295 only, with the display driver disabled for that card.

CUDA and OpenCL are fundamentally for computation, not graphics. The major of applications don’t do anything which is dynamically visual and don’t have an intrinsic “FPS” number. Performance analysis is usually considerably more sophisticated and subtle than that.

I don’t think so. Both NVIDIA and ATI are supplying special beta drivers with OpenCL support, but I don’t believe either company have OpenCL support in their current release drivers. Further to that, I don’t believe it is possible to build a single application that can run on either flavour of hardware. You can take OpenCL code and compile it with either vendor’s SDK and it will probably work, but the back end code and support libraries are completely different and incompatible. Further to that, both flavours of OpenCL are really in beta and the performance and capabilities of both appears to be inferior to either vendors proprietary GPU compute environments (ie. CUDA or Stream).

Is this need driven by a potential application you have in mind to develop yourself and want some indicative numbers before you start, or is this purely an exercise in phallus metrology?

There is another problem. Right now, at least in CUDA, the combination of NVIDIA’s drivers and WDDM imposes a number of pretty serious limitations that other platforms (Windows XP and Linux) don’t have. As a result, I don’t think there are really that many developers seriously working with Vista or Windows 7 for multi-GPU or large memory applications. There is rumoured to be a fix for this on the way, but it will probably involve having to install a special “compute only” driver for non-display cards. Which, again, isn’t what you are asking for.

I know about one graphical app that does Multi-GPU: Flam4 - a fractal flame renderer for Windows. It renders .flame files that you create with another app called Apophysis, about several hundred times faster than on the CPU. And on my 4 GPU rig which I built for Folding@Home this app really smokes ;)

The Mandelbrot SDK sample is desperately asking for someone to make it multi-GPU capable. I was imagining that one could split up the rectangular rendering area into smaller rectangles with a treemapping algorithm, where each GPU gets a share of the total that is proportional to number of shaders * shader clock.

I see one issue with splitting up graphical apps like this. The CUDA OpenGL interop is slow when used on non-display cards. All rendered data has to be sent back to the host and transmitted to the displaying GPU, which adds extra overhead and potentially kills frame rates.

I would also love Multi-GPU support for G80/G92 cards to be enabled in the Optix raytracing SDK. Otherwise I have to work around it and split up my screen area as described above, effectively running 4 renders in parallel.


Thanks so much for your very informative post’s. B)

BTW, here is that link again…


The reason I want any app that will show off the value of the GPU as a computational device is that I believe in the GPU revolution.
I think it’s time the CPU takes a back seat to the GPU for animal speed.
I think CUDA and Fermi are the way of the future, and want numbers to prove it.

I also think running multiple GPU’s like me and my 295 and 280, and with me only running only 1 display, should have at least 1 benchmark to show for it, that looks for our extra GPU’s.
It seems to be a strong value add for CUDA… Give us a benchmark that stresses this important point.
I have seen it posted that CUDA can now use both GPU’s while still in SLI mode, I also know having a GPU operating in dedicated PhysX mode, is also no problem for CUDA.
I am getting frustrated that we have yet the see the app or benchmark that will use all 3.

If I could give Nvidia marketing a few words they would be…

1 - First I would like to say how well the opening keynote video with Jen-Hsun Huang was done. I enjoyed to see the progress going through the years. The dramatic music when the birth of CUDA was shown, it gave me goosebumps.
We need more posted on the net!!

2 - Go out of your way to get an OpenCL and CUDA benchmark made that has some nice graphics involved if possible. There are many like me, who do want the GPU to get the respect it deserves. I favor the CPU -vs- the GPU type in single and double precision calculation performance. I would also like to see it use all the GPU’s in my system, and have the benchmark reflect that. It’s smart business to make us want to look at snapping a second or third GPU in our systems, as a valid method to upgrade our performance.
(And not just for games, that is short sighted IMO.)

3 - Have a contest for complete CUDA apps for Windows 7 with download links.
Let the people submit their apps…
The winner gets Fermi… (Do you have any idea how much leg that would get on the boards?) :woot:
It would be a total win!!
Let us vote on the winner. Put it where all would have access, to be able to vote, and download the app’s.
You never know what you just might get, and the publicity would be good for all involved.

I have the faith more apps will be made for us commoners, and our Windows 7 systems that we will want to run with time. The army of CUDA programmers that is being raised up will prove to be invaluable.

4 - On the OpenCL front, insure you have something for us guys with 200 Series cards and OpenCL WHQL drivers to download.
I want to see an OpenCL app run on my system. It will be fun to see the ATI boys run it too. :P
I want the demos I have seen in the videos ready to run without having to install an entire developers package.
Needless to say, have it also sniff to see if we have more GPU’s just sitting there, that want to join in on the calculating party.
Mine do!

The Internet is literally full of conference papers, journal papers, presentations and theses which describe real applications and performance numbers and comparisons to other architectures (CPUs, Cell, FPGAs,…). NVIDIA even have a number of them for download on their CUDA website. What you are asking for there for the taking, if you care to look for it.

So phallus metrology it is.

Conference papers, journal papers, and presentations are OK, but I kind of wanted to generate performance numbers on my system, and my buddies on the board. (More fun.)

I keep hoping they make some progress here…

The CUDA Factorial Benchmark from Total-OC.ru

That using all GPU’s in a Windows 7 System must be tricky business?

I also like Cuda-z…
It does have a nice performance measurement in it, but it’s also 1 GPU at a time.

I think the first app or benchmark to use all GPU’s at the same time will be very popular. Let it be by Nvidia’s hand, and done well.

I am guilty of thinking guys that run more GPU’s, should get better performance numbers on their system running CUDA apps. :wave:
The brutal truth is in the CUDA benchmarks released so far, a guy running a single 285 will produce better numbers than me running both a 295, and 280.
It would be in Nvidias best interest to have a sweet benchmark out, that gives a better ‘Total System’ performance number. :))
Let me see what numbers may generate on my system, when both sides of my 295, and 280 all think about the same thing. :devil:


That was just wrong… :haha:

Funny, yes!!

But still wrong. :no:

Looking into Sandra Light now:


Initial impression is the CUDA benchmark looks odd with a 5870 being over twice as fast as a 280, and faster than both sides of my 295…

And it is missing the ‘Use all GPU’s in my System’ button.

One more benchmark IMO, that won’t measure your Total System Performance.

Question: In the Sandra test, both the OpenCL and the Compute Shader test only used 1/2 of my 295.

What controls if 1 or all 3 of my GPU’s get used for the task?

If it’s strictly an app thing, and the way it’s written, I just need to hope for the best.

If it’s Microsoft’s Direct Compute that does (in the case of Compute Shader Test), Nvidia should make every effort to have Microsoft look for more idle GPU’s in your system to load up too.

Best case it’s driver controlled, and Nvidia could make sure that happens in the near future.

Probably app though I’m guessing.

I was pleased to see that the CUDA test did use both sides of my 295.

I still am mulling over this post by avidday: “Further to that, I don’t believe it is possible to build a single application that can run on either flavour of hardware. You can take OpenCL code and compile it with either vendor’s SDK and it will probably work, but the back end code and support libraries are completely different and incompatible. Further to that, both flavours of OpenCL are really in beta and the performance and capabilities of both appears to be inferior to either vendors proprietary GPU compute environments (ie. CUDA or Stream).”

Kind of shocked about the OpenCL answer.

Not that it’s way behind CUDA in performance, development, and tools… (I agree, and think it will be for some time.)

But it makes it sound rather bleak to develop in OpenCL. If you have to code it a special way for each GPU vendor in OpenCL for it to work properly, you might as well went right for CUDA or Stream right out of the gate.

I guess it’s CUDA FTW?



Folding@Home and FahMon may be the tools to do just that. And the Points per Day (PPD) metric is the one to brag with on the forums.

I get about 17000 PPD with my Quad GPU system - and the heat of the exhaust fans does about compare to my blowdryer.

This could be because the CUDA forums here are quite developer-centric and not targeted much at end-users. Also most people here seem to be professionally or academically involved with CUDA. So the idea of coming up with such benchmark tools doesn’t really resonate much here. When we do care about benchmarks, it’s usually how fast we can get our own apps to run… ;)


About Folding@Home, Yep…

I have generated over a million points for the mighty team EVGA. B)

Too bad with folding, you need to run 3 instances of GPU folding to use 3 GPU’s.
Note with VMWare and CPU folding, the BETA version lets all 4 cores of your CPU fold on one instance.
Your GPU’s should too.

I would rather have 1 program running, with a graphic representation of my GPU’s that are installed in my system in a GUI, and me just click to activate how many I want to load up.

That would be outstanding.

Well then from the perspective of an end user, if any of the developers here ever make a CUDA app for Windows 7…

Do the best you can to insure that both GPU’s in SLI, and Dedicated PhysX GPU’s get loaded up, when you task our systems with work.

We need to better show how running (3) Nvidia 200 Series GPU’s and CUDA apps, opens the door of new possibilities that the CPU couldn’t give us.

Sorry if you took that as an insult, it wasn’t intended to be - it was intended to be a pointed critique of what you seem to be asking for.

My problem is that you keep talking about an “app” - but you haven’t mentioned once what that application might actually do. So I can only conclude that you don’t actually care much what it does, just so long as it does it on 3 GPUs. Most people here do “real” stuff - imagine processing, computational physics, things like that. So what is it you are really asking for?

What I am asking for is the best possible future for the GPU Revolution.

I am looking/begging for developers to produce apps/benchmarks that use all our GPU’s at once.

I still would love to find one.

As far as what the apps will do for us, who knows?

That is why I encourage the idea of contests for complete CUDA apps.

Can CUDA be used for retinal security scanners, voice recognition, Computer Speech, enhance photos/video, ray tracers, PhysX, Folding…

Who knows?

Like I have posted, I have faith CUDA apps will be produced down the line, that I will want to run.
I have no worries about that.

What alarms me currently is the outstanding fact that CUDA can access all of my GPU’s in the system, but I can’t find many apps that actually do.
Even system performance benchmarking programs…
I simply think we need that to change.

I also think it’s good for GPU Computing in general, to produce GPU -vs- your CPU performance numbers generated on your system.
It will put it right in your face that the real action is processing on your GPU(s). With CUDA, I just think its a crime to only run on 1/2 of my 295.

You can understand that… :)

nVidia haven’t made this part very easy. They could have made automatic work splitting across several GPUs part of their programming API. That would have taken most workload off the programmer, enabling automatic scaling with the available resources.

Instead they make the programmer responsible for checking how many GPUs there are in a system, then he has to create multiple threads and initialize CUDA contexts to compute for each device. In the end the partial results of all GPUs has to be aggregated into a final result. Out of 30 or so SDK samples only one or two samples demonstrate how to do this.

I am not surprised most programmers don’t bother adding Multi-GPU to their apps as it adds a lot of extra burdens and pitfalls (multithreaded programming being one of them)


Actually I think people tend to make this more complex than it should. You’re running thousands of threads on the GPU and you cant

manage X (X being 1-8) CPU threads to connect to each GPU? come on…

You have samples in the SDK, GPUWorker from Mr anderson and you can use basically any CPU threading model/sample from the internet,

just add cudaSetDevice and kernel call and you’re done.

Regarding this thread (pan intended) - I think avidday is correct - most of us write applications specific to what we need/do and everyone

of us has our own benchmarking - otherwise I wouldn’t know if I indeed got a boost using the GPU or not.

My application for example has ~x60 performance boost over the CPU code.

More specificaly - if you go to http://www.nvidia.com/object/cuda_home.html#

you’ll see some of the performance boosts people got.


That might be true for “embarrassingly parallel” problems, but there are whole classes of applications where distributed memory/out of core type implementations are really difficult to do efficiently. In distributed memory computing, minimising communication overhead is key (in the mutli-GPU single host model that means PCI-e bus speed/latency, in a distributed memory cluster it would be network speed/latency). Getting a scalable, fast implementation of even rather standard problems (linear algebra solvers, graph partitioners and the like) for distributed memory systems is still an active research topic.

Our friend Talonman seems to think that his dream multi-gpu benchmark doesn’t exist because of developer recalcitrance. But the truth is that many types of problems are really hard to do well in distributed memory systems, which is what multi gpu effectively is. And his argument about why such a benchmark should exists seems to boil down to “because it would be awesome” or maybe “because it would be fair reward to those who have bought multi-gpu rigs”. Perhaps I don’t get it, but neither of those seems like a particularly compelling development reason to me,