Looking for CUDA apps that can use more than 1 GPU.

Raw performance…

I want to see it in the benchmarks, and believe it will carry over into the numerous new CUDA apps that I do/will be running.

I only want CUDA apps to shine if you have 720 Stream Processors available on your system, just waiting for work.

I would be glad for a benchmark, even if it was working on an “embarrassingly parallel” problem, to load up all 720 best it could.

Is benchmarking a noble cause, no. But it’s not evil either, nor is wanting to load up your system close to it’s max potential. External Media

As to how efficiently actual apps may be able to use 3 GPU’s, as you can tell, I can’t speak to that.

I’m just tired of running on 1/2 a 295, most of the time.

I was hoping when I new CUDA could now hit on both sides of my 295 while in SLI mode, and already knew it could access my PhysX GPU, that it would solve most issues.

Guess not! :">

You see the pictures of racks and racks of GPU’s, all working together, and I just want to keep 3 busy…

Now I feel like I’m asking for the world. External Media

Rotten End Users!!

Against all odds, I am going to hold on to my dream that such an app may some day be produced.

(And Nvidia should give them big kudos…)

From a Windows 7 perspective, I believe it will the first of it’s kind. (At least that’s free to download!)

It would instantly be installed by every Nvidia user in the known universe.

If it had your company logo on it, the advertising and chatter in the forums would be priceless.

I think there is a very real incentive to whoever produces one, and it would be good for CUDA in general.

I think this cuts to the core some of the “understanding barrier” that seems to be at work here. CUDA (and OpenCL) applications don’t see 720 stream processors. It doesn’t see any stream processors. They sees three discrete GPUs, each with separate memory. The GPUs can’t “see” one another and they can’t “talk” to one another. Each GPU is totally independent. If the application is to work with multiple GPUs, the programmer has to devise a way for that to happen. CUDA itself or the driver can’t do it. Depending on what the application is try to do, it can be very difficult to do efficiently. It can even be slower than just using a single GPU, because the PCI express bus is 20-30 times slower than GPU memory, and anything that has to be passed between GPUs has to traverse the PCI express bus twice to get from one GPUs memory to another.

Those changes that you are talking about only mean you don’t have to turn off SLI to use CUDA. It changes nothing from the CUDA side at all and it has no effect on the multi-GPU capacity of CUDA. CUDA has nothing to do with SLI. They are completely orthogonal. The SLI link can’t be used by CUDA applications (it is actually slower than the PCI express bus anyway).

Another misconception. Those “racks and racks of GPUs” aren’t all “working together”. The are plugged into racks and racks of totally independent compute nodes, and each compute node is using maybe one or two GPUs simultaneously. Not more. When a whole cluster is running one application (which in my experience isn’t that often), those racks and racks of CPUs are doing all the thinking and inter-node communication required to glue the whole contraption together. The cluster nodes chug along in unison at ethernet or infinband speed, bringing their GPUs wth them. And they almost universally run Linux, not Windows. It isn’t remotely like what you seem to be imagining, and it really has no bearing on your dream benchmark application.

In my first reply to you, I gave you a link to HOOMD, which is a real, live high performance multi-GPU CUDA code for simulating molecular dynamics. It contains an outstanding example of a multi-GPU master-slave implementation for letting one CPU drive many GPUs simultaneously using CUDA. The guy who develops it posts here quite often. I believe it formed the core of his PhD thesis. It isn’t my area of expertise, but from what I understand MD problems probably could be classified as embarrassingly parallel problems. Roughly speaking you can chop up the total amount of work in a give step of the application into fairly independent parcels and process each parcel separately, and HOOMD does exactly that. Molecular Dynamics codes have been something of a poster child for how good GPU computing can be compared to CPUs. Speed ups of many tens of times over mutlicore CPUs is normal.

The HOOMD distribution contains a demo call micelle. If you run the micelle demo on the CPU version of HOOMD, it would probably take a week to finish. If you ran it on one half of your GTX295, you would be waiting about 18 hours for it to finish, or about 9 or 10 hours for both halves. Maybe 14 on your GTX280. If you ran Linux. and could use all three of your GPUs, it might take 6 or 7 hours to finish. For the guys who do computational chemistry that is a huge deal. To get that full set of numbers you would need to turn over your PC to running HOOMD exclusively for a couple of weeks 24/7. As a fanboi benchmark it is probably about as interesting as watching paint dry.

And that, in a nutshell, is the dichotomy between what people are really using CUDA for and what you want it to do for you.

Talonman,

Among others embarrassingly parallel problems there is such thing as brute-forcers in cryptography. It’s debatable in usage terms but good example of real world problem which can hit almost peak of theoretical GPU performance. For nVidia’s GPUs it’s 95%+, which is quite good I think. Also SHA1 hashing was main thing in (nice and funny btw) Engine Yard’s contest which was widely discussed at this forum.

I’m kinda bored to “advertise” my hash cracker here again, so just search for “hash gpu cracker” with google, there are plenty of them around as it’s really easy to write such program and it’s kinda funny to see millions and even billions of hashes processed per second. Most good programmed hash crackers can utilize every GPU in system and that’s mean as much as 8x for nVidia.

With your GTX295 + GTX280 config you’ll see around 3*700M=2100M md5 hashes/sec. Which is kinda, I guess, erhm (well, OK, I won’t use some MMO word here as it kinda other type of forum but it’s still genitalia thing).

And now main question – so what? What that 2100M and 3x GPU running on 100% means to you? Just pretty looking numbers without any knowledge what lies behind them? No offense but it’s pretty pointless imho.

I’ve seen on some forums that people trying to use my hash cracker as benchmarking tool. And results tbh just boring – it scales almost linearly with GPU count and shader speed. 4xGTX295? OK, it’ll be around 5600M MD5 hashes/sec, that’s all, end of story. Also ATI vs nVidia comparison for integer math doesn’t looks good at all, better just skip it.

Certainly multi-GPU is anything but trivial, since it’s really equivalent to traditional supercomputing - multiple isolated systems (each GPU) that can only share data over a slowish network (PCIe bus). That being said, right now, it’s harder than it could be.

First off, the requirement of separate host threads for each GPU baffles me. While I certainly think that it would be a good option, a lot of the time, a model like that will actually crush performance. Case in point is any sort of master/slave system. If there aren’t sufficient CPU cores for all threads to be active at once, you get truely horrible latency between the master dispatching a command and the slave receiving it. As in truely horrible, I’ve measured on the order of 10s of ms. This is because in this case, the program is completely at the mercy of the OS thread schedular for getting the master thread to run when there’s an idle slave, and getting the slave thread to run in order to invoke its latest kernal. Now, you can improve this a fair bit using aggressive thread priority juggling and switching, but it still suffers from the underlying highly variable amount of latency from master to slave. I haven’t been able to accurately measure this, since the profiler appears to suppress the yield on sync flag, thus returning results of 50% idle time, but I estimate that I still get 5-20% GPU idle time with flam4 on my system, which wildly varies according to system activity, as well as causes unknown. This is anything be ideal. The point is that if I could simply drive all GPUs directly from the master thread, this inefficiency would vanish.

A second, but lesser, problem is that there is no way to query which device is physically connected to the monitor that the cuda program window is on. This is quite relevant when there’s openGL or DirectX interop, since if you pick the wrong device to merge results, then the results have to go back over the PCIe bus a second time to reach the device that can actually display them.

I totaly agree that things could have been made easier and I think nVidia added a few features to assist in this in the latest versions.

Nevertheless I think sometimes people focus on the host threads too much (or even dont know how to write a proper CPU multi-threaded application)

and then try to run 1000s of threads on the GPU.

From my experience the host thread count vs GPU cores is not an issue. First because I try to have a 1:1 ratio but even with twice the numbers

of GPUs then the CPU cores, I still dont see any problem - mostly I guess because the GPU work takes a few seconds to complete a single task

and it has a lot of such tasks, so the network/cpu threading/… and even the PCI overhead is masked. I guess its an application issue.

One thing though that came into my mind while reading your and avidday response was that maybe GPUs in one system, can be indeed be seen

as one logical “big” GPU, much like you can do with CPUs today - with IB and products like ScaleMP (http://www.scalemp.com/)

When you run your task on one GPU hopefully you open as many blocks/threads as you can and use as much memory as you can - what

if you can see 4 * 4GB GPU RAM as a flat 16GB and 4 * 240 streaming cores as 960 cores and have the CUDA driver/ScaleMP/… worry

how to distribute the task for you?

What do you think?

eyal

unit

Thanks once again for your response. It is crystal clear in my mind now that the system sees three discrete GPUs, each with separate memory…

I simply was not giving too much thought to how a single app’s workload would be distributed among the 3.

I was thinking loading up 4 cores on a CPU is not Black Magic, 3 GPU’s should not be a major issue…

I forgot, or wasn’t thinking about them not having shared memory, and having to communicate over PCI express.

The PCI thing to me would again depend on what the app was doing, and exactly how much data was required to travel over the bus per frame.

If were talking about just (or mostly) results traveling back, I would think having three discrete GPUs, each with separate memory would still help performance allot?

(Providing it’s used, and the work could be divided.)

I simply knew all 3 of my GPU’s were now accessible to CUDA, with me keeping my 295 in SLI mode, and 280 in PhysX mode, which is correct…

I just didn’t give the problems with work distribution enough thought.

Good trivia on PCI being faster than the SLI link.

All good info, thanks again.

Yep, as I have posted, I just want to process on more than 1/2 a 295, on my dream benchmark or application, and fold, and PhysX, Ray tracing, and whatever else might come down the road…

“And that, in a nutshell, is the dichotomy between what people are really using CUDA for and what you want it to do for you.”

I hope I am not the only Windows 7 user who would like CUDA apps, to use all 3 of his CUDA ready GPU’s installed in the system…

Benchmarking apps can be valuable tools. They do give us an indication on how well some apps may perform.

I would like to see how much faster 720 Stream Processors running at 1512MHz are, in both single and double precision calculation speed, than my Q6600@ 3.81GHz running on all 4 cores.

For a GPU fan like me, that’s enjoyable… External Media

I like to try and wrap my mind around what running with my GPU config and CUDA could give me, over what my CPU could provide alone?

I still hope the answer is: Animal speed in the apps that can put it to use. If that is the case… Then Go CUDA!

A few performance numbers on what your system can generate when running on all GPU’s wouldn’t hurt the ‘GPU Revolution’ idea either…

I feel like I need to keep stressing that I want CUDA apps and benchmarks. I think CUDA has a bright future right along with GPU computing.

I’m not a bad person because I want to see all 3 of my GPU’s get used… External Media

If there is anything Nvidia can do to help an apps workload be distributed over your GPU’s, I do hope they make every effort to do so.

It would need to be a front end app, before CUDA correct?

A CUDA Workload Manager Program… ?

The idea that ‘Snapping another CUDA ready GPU in the system will give you added performance in GPU accelerated CUDA apps’, works so much better if we have a valid/workable way to load them puppies up.

There must be a way! External Media

I wonder if it would help any if Nvidias driver had a built in GPU sniffer, that always reported how many CUDA ready GPU’s were installed in the system, and their status as to if they were just waiting for work, or busy. This information could be written to a fixed register that anybody’s CUDA app could read and instantly know whats available to send work to?

Thanks for the tip, I will give it a look.

And systems with racks and racks of GPUs are shared between lots and lots of people. Every GPU is usually doing something different. And when a researcher like me goes to submit jobs to run, we typically submit dozens of independant jobs which then wait in queue for free nodes to run on :) If watching a job run for 6 days is as fun as watching paint dry, where does that put watching your 6-day job while it sits in the queue?

MD isn’t completely embarrassingly parallel. A majority of the inner loop can be computed in an embarrassingly parallel way across many GPUs. But at every iteration (there are 1000+ iterations/second in a fast simulation), data needs to be shared between the GPUs. The size of the data is small (maybe only 1MB, depending on the system size), but that it enough to bring HOOMD on many GPUs way down in speed, hence the benchmarks on the hoomd webpage are worse and worse for systems with narrow PCIe lanes. Someday, I need to redesign the way hoomd runs on multiple GPUs to limit that communication. But, that is low on my priority list now. There is just so much science we can do with just single GPU performance now!

And to the OP, I do know of one benchmark that will do what you want it to :)

VMD (http://www.ks.uiuc.edu/Research/vmd/) has an ion placement module. I don’t know how to run it, but there must be an example or demo script somewhere on the net, maybe in the VMD documentation. It performs a huge amount of embarrassingly parallel computations. It has such a tight inner loop that it will even nearly max out the capable GFLOPs of the GPU. And, it automatically uses multiple GPUs!

Been there, bought the t-shirt :) Probably the only thing better than having to wait 6 days for your job to percolate to the top of the queue, is then discovering that the submission script has a typo, or that the directories holding files the job needs are temporarily offline, or the multitude of other things that can go wrong. It was all of that stuff that prompted out group to put together a cluster for our own exclusive use. It is humble by most computer centre standards, but it is ours to do with as we please…

I presumed I would get called on that. I don’t really know anything about MD, but I ran HOOMD quite a bit during our cluster commissioning as a burn in job, so thanks for making the code available, even for laymen like myself to misuse.

Thanks!

Checking out: VMD (Windows XP/Vista/7 (32-bit) with OpenGL and CUDA)

So far so good…

Look’s like it found all 3 GPU’s.

Now I just need to figure how to load it up and see if my clock speeds jump in Precision.

UPDATE: This app does indeed use all 3 of my GPU’s by default.

Impressive.

All 3 of my GPU’s were at their 3D Performance clock speeds, until I shut the app down.

Detected 3 available CUDA accelerators…

Multithreading available, 4 CPUs detected…

Now that’s what I’m hoping to see more of.

Well done!!

Thanks again. :)

That’s all I want to see happen. Guys with more GPU’s getting better performance than guys running 1 in CUDA GPU accelerated apps.

It’s only just, and good for Nvidia. External Media

It may not be practical in some cases, I can understand that due to the nature of the app, but it is exciting for me to see that it can be practical in others.

It will help to move the GPU Revolution forward.

It feels so nice to have your extra cards both acknowledged, and appreciated by the app. (All CPU cores too for that matter.)

Total System Availability for the app = Max Performance that your system can give… Sweet.

Now how can I benchmark this puppy? External Media

Tutorials from SC09 has been posted. You can look at page 5 here for some benchmarks:
[url=“http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_CUDA_luebke_Intro.pdf”]http://gpgpu.org/wp/wp-content/uploads/200...uebke_Intro.pdf[/url]

Well, I thought it would have been easier to find a VMD GPU benchmark script. John says he will share his when he has time:
[url=“http://www.ks.uiuc.edu/Research/vmd/mailing_list/vmd-l/14185.html”]http://www.ks.uiuc.edu/Research/vmd/mailin...md-l/14185.html[/url]

This page discusses all the GPU optimized features in VMD:
[url=“GPU Acceleration of Molecular Modeling Applications”]http://www.ks.uiuc.edu/Research/gpu/[/url]

Thanks… GPU’s RULE! :)

Thanks for the info…

BTW - Just thrilled about reading this from your second link:

“The direct Coulomb summation algorithm implemented in VMD is an exemplary case for multi-GPU acceleration. The scaling efficiency for direct summation across multiple GPUs is nearly perfect – the use of 4 GPUs delivers almost exactly 4X performance increase. A single GPU evaluates up to 39 billion atom potentials per second, performing 290 GFLOPS of floating point arithmetic. With the use of four GPUs, total performance increases to 157 billion atom potentials per second and 1.156 TFLOPS of floating point arithmetic, for a multi-GPU speedup of 3.99 and a scaling efficiency of 99.7%, as recently reported. To match this level of performance using CPUs, hundreds of state-of-the-art CPU cores would be required, along with their attendant cabling, power, and cooling requirements. While only one of the first steps in our exploration of the use of multiple GPUs, this result clearly demonstrates that it is possible to harness multiple GPUs in a single system with high efficiency.”

I would love to see John’s benchmark script too, when he has some time to share… B)

With this program already having automatic GPU recognition for both SLI and PhysX, and reported scaling efficiency of 99.7% with up to (4) GPU’s…

The way I see it, were one script and a timer away from having ourselves an outstanding System Benchmarking utility with graphics included! External Media

No Joke!

Don’t get me wrong…

As a “high-performance, cross-platform molecular graphics viewer, used for (among other tasks) displaying static and dynamic structures, viewing sequence information, and for structure generation and dynamic analysis” app, I can already tell it’s the best out! External Media

I do respect that, and also just love that it’s CUDA.

But it still looks like a killer benchmark app to me. External Media

I don’t know if it’s an indication of good performance yet, but if I fire up FRAPS, the default animation gets 60FPS no matter how big I make the screen…

(Must have Vsync on by default.)

In Nvidia Conrtol Panel for this application, I went in and set:

Vertical sync to: Force off…

Anisotropic filtering to: 16x

Antialasing Mode to: Enhance the application

Antialasing setting to: 4X

Antialiasing Transparency to: Supersanpeling

Texture filter: On, Clamp, and Quality settings.

Now it’s pegged at 100… :)

Bet she’s gunna be a runner! External Media

Update: Loading in the PDB (Protein Data Bank) file for ‘inates of the atoms in myoglobin’…

And sizing it to fill my window, I get 10FPS to upper 60’sFPS, normally in the 30’s I’d say.

For an extra wide shot like this, it dropped down to 18FPS.

I just rotate it to a different viewing angle, and I’m back up in the upper 60’s while still using the exact same PDB…

Love it…

It does still feel smooth and responds well even when it is only generating 18FPS.

Once again, glad all my GPU’s were allowed to do their part toting the line.

I don’t think it’s necessary to have the fastest system, only that you get the best performance with what you have.

I would love to see your SLI/PhysX GPU’s auto-configure, and work to be efficiency distributed among them when running Folding@home.

Rebooting and taking your GPU’s out of SLI Mode is kind of a drag when you want to fold, and having to put it back for gaming. I’d much rather not have to.

One instance of Protein Folding, processing on (3) GPU’s would fly.

Or go for the Gold, and add the CPU’s cores into the processing pool too, and load them up all in the same app. :)

We would need a GUI with sound effects every time we took another GPU or CPU core, on or off line.

Independent clicking control over each CPU core, and GPU, would be extremely gratifying.

Engage the 280!

Bring 2 more cores of the Q6600 online!

Check how much that reduced our estimated current WU’s completion time…

Quick, Give me a reading on how our systems total output changed in (PPD)…

Muhahahhahhaaaa!!

Heck, some would fold a few Work Units because it was fun, and they wanted to see what their systems could generate.

A Work Unit would no longer be a multi-hour commitment.

If possible, we would need a small entertaining graphic to stare at, representing the incredible number crunching that is going on behind the scenes.

It would need a low system overhead, we wouldn’t want to slow our folding down.

For me maybe a 295, connected to a 280, connected to a CPU showing (4) cores, and some stats if possible on current system output.

Sorry, I’m rambling. But yep, I’m a user!

All I can say is, Fine job University of Illinois on this gem…

I think you did er right. ;)

If you added in a ‘System Benchmark’ button, ran some demanding script, then gave us a few fun calculating statistics…

It could become the hottest Windows 7 CUDA Benchmark out, and one of the first apps to show added performance from running multi-GPU’s. External Media

That’s huge in my book!

I loved the easy install, and that all 3 of my CUDA Accelerators auto-detected, and that they successfully were added into the CUDA Device Pool. External Media (SLI and PhysX!)

Most important to me, is that all GPU’s were used by the app.

Real GPU Revolution Stuff…

I just found this…

http://www.ks.uiuc.edu/Research/vmd/vmd-1.7/devel.html

VMD 1.7 Development Status

VMD 1.7 Released (8/1/2001)

  • Updated Windows installer files with new release notes, etc.

    • Cranked version numbers to 1.7.

    • Fixed missing NULL as last argument to Tcl_AppendResult; error messages in transvec and transvecinv would cause VMD to crash.

    • Significant updates to the VMD benchmark scripts.

If VMD has any built in benchmark scripts, I have yet to find them…

I did find the VMD Script Library:

http://www.ks.uiuc.edu/Research/vmd/script_library/

Still not sure how to load and run a script on VMD. I would love to.

I do have a chess engine running on a single CPU, using many threads.

I could adjust it so that is runs on multiple GPU’s :)

I would be thrilled to see your chess engine running on multiple-GPU’s…

(Especially if it used all 3 of my GPU’s!) External Image

The best would be to have an option for CPU or GPU.

I would like to know how long my (3) 200 Series GPU’s would do time wise, when compared to my Q6600@ 3.81GHz processing chess.

CUDA Chess would be a fun app! Could the CPU play the GPU? We might be able to decide the winner of the entire GPU Revolution thing real soon! External Image

Just for the record, I did try to check out HOOMD.

I used the guide…

Installing HOOMD-blue in Windows

  1. Download the HOOMD-blue installer from http://codeblue.umich.edu/hoomd-blue/

    1. Install Python 2.5.x using the installer at http://www.python.org/download/

    2. Install the latest drivers for your GPU from http://www.nvidia.com

    3. Uninstall any previous version of HOOMD/HOOMD-blue before continuing

    4. Double click on the HOOMD-blue installer and follow the on screen steps

HOOMD-blue should now show up on your Start menu and the .hoomd file type is registered to execute the script when you double-click on one. The command hoomd can also be run on the command line to start the HOOMD-blue python interpreter.

The start menu also includes links to the standard benchmark scripts.

Check out the Quick Start Tutorial to learn how to use hoomd.

Step’s 1-5 went fine, but running from the Start menu gives me a missing dll error, same with entering hoomd on my command line.

I tried clicking on a hoomd file in the bin directory, but it also doesn’t know I already have Python 2.6.4 installed. (Just for this app.) :)

External Image

And how exactly is python 2.6.4 an installation of python 2.5.x, hmm :)

Though I do see now that python no longer offers 2.5.x for download: http://www.python.org/download/ I will post an updated build of hoomd sometime in the next couple days, one that will work with python 2.6.

Seriously, python has such a fast paced development cycle and obsoletes old versions waaayyy too fast it is impossible to keep up. Systems that run a version of RHEL (most of the clusters out there) are still on python 2.4!! So I need to support 2.4, 2.5, 2.6, in hoomd and probably need to start thinking about 3.0 sometime soon before they obsolete all 2.x versions.

Thanks for the info…

I look forward to the update. External Image

I have a chess engine running on CUDA. I could adjust it so that it uses more GPU cards

Outstanding!!

Yes, please do…

I want both sides of my 295, and 280 running it. External Image

Thanks for the update.

It was recommended to me on Xtreme Systems, that NAMD would be a better program to benchmark that VMD.

I still need a benchmarking script, and info on how to run it for VMD, so I thought I would try out NAMD.

I didn’t have much luck with NAMD running on Windows 7.

I am not sure if NAMD will use all my GPU like VMD does or not…
Or if NAMD has a benchmarking script I could run.
Is NAMD CUDA?

(I really still want a benchmarking script for VMD. It’s my favorite!) :)
Or the addition of the ‘System Benchmark’ button.