CUDA Cluster Programming Any1 Experienced?

Hi All,

Have any of your worked with CUDA clusters?

Does NVIDIA sell CUDA clusters?? If so, is there a cluster API or sthg like that? sthg like MPI ?? I am eager to know the topology and software and libraries that make it work.

If I want to run my app on CUDA clusters, should I link against a different set of libraries?

Can some1 tell me about the “Best Programming Practices” on a CUDA cluster?

Would greatly appreciate even the smallest detail on CUDA clusters that you can share.

Thank you

Best Regards,

What do you mean by a CUDA cluster? A distributed memory cluster with GPUs attached to every node?

“Stony Brooks” Visual Computing Center has a GPU cluster – uses MPI for communication. I have read a paper from “Zhe Fan, Feng Qiu, Arie Kaufman, Suzanne yoakum-stover” from Stony Brooks on GPU cluster computing.

This is what I see on the top of the paper “ACM-IEEE Supercomputing Conference 2004, November 06-12, Pittsburg, PA”

I dont think I can divulge info on their paper as I am not sure if I got it from company’s IEEE/ACM account OR from the NET. Google it. you may be able to find it.

They did that in 2004 – I think without CUDA… Also, there did NOT use PCI-E those days. The paper says that PCI-E would be available at the end of that year…

BUT now, we have everything – CUDA, PCI-E… I was wondering, if there are CUDA cluster implentations which can allow applications to spawn multiple kernels on multiple nodes of the cluster etc…

I am sure these kindaa clusters exist. I am wondering whats the library and drivers and other cluster software that is used for this purpose.

Any ideas?

Best Regards,


Umm, just build a normal distrubuted memory cluster with all the standard hardware (i.e. Inifniband) and software (i.e. MPI and a job scheduler). Then put a Tesla C870, D870, or S870 on each node. There really ins’t anything special you have to do CUDA-wise, all the standard stuff will work. Given the raw speed of the GPU, I would get the fastest interconnections that money can buy so that the internode communication isn’t slowing the calculation down too much.

If you have more than one GPU per node, then you’ll have to add something to the job scheduler so that your MPI process knows which device to cudaSetDevice. I think this kind of addition is possible in PBS/torque, but I’ve never tried anything like it myself.

Does MPI allow you spawn GPU kernels on multiple nodes of a cluster ???

I am a bit confused on this part as I dont know MPI.

I want my app to do this:


And, I would assume that A and B could run on different GPUs on different/same Cluster nodes.

NOTE: This model would require me to link my CUDA code against a CUDA-MPI-adapter-library. This would enable to CUDA code to run on clusters seamlessly without hitches. Just a re-compilation would do. That sounds rosy, aint it?

I have never written an MPI based cluster app myself. Can you show a bare-bones skletal model of code that would take advantage of a cluster ??

Will a cluster enable an APP to run differnt code in differnt nodes???
Does Cluster provide Cluster-services that an application can blindly take advantage of ?

Sorry for a confused reply. Hoping to get some clarity…

best REgards,

I think I saw something about a NVIDIA cluster once on the nvidia website talking about 64 nodes but I’m not sure about it, was a very long time ago. Maybe some1 from nvidia can comment on this?

MPI will not let you do this: There is no “magic” MPI-CUDA adapter that can make your implementation details dissapear.


MPI has absolutely no relation to CUDA. It is a way to write code that runs on a distributed memory processor and communicate efficiently between them. Since each MPI process is running independently, you can have each process run a separate GPU.

MPI acts similar to a CUDA kernel, where you write one code and that same code is executed on every process. So you would need to do an if based on the address of the current node in the grid (like threadIdx.x in CUDA) to test which kernel to call.

MPI also has a large number of functions for efficient internode communcation.

If you want to learn about MPI, the internet has some good tutorials. I’ve never written an MPI program myself, I just know a lot about it.

Understand that you need some means of communicating the state of the kernels between the PCs in the cluster, else how could cudaThreadSynchronise() work? CUDA has no such means. PBS and Torque have no such means - they have no CUDA interface. So, you’re going to have to cook that bit up yourself. Perhaps at some higher level in the code: CPU app could manage it. CPU app could use MPI sync primitives to do it. But MPI doesn’t know anything about CUDA.

Go buy a copy of “Using MPI, second edition” by Gropp, Lusk and Skjellum, and start reading. It is not as simple as your question implies you believe. It’s also not terribly complex. FWIW your second case - cluster provide services an app can blindly take advantage of - is closer to reality. However you cannot get much benefit out of blindly using a cluster.

Here is a hopefully more fair answer than the first one I posted:

A cluster of computers is generally some number N of nodes - each comprising CPU(s), memory, network interface(s), maybe disk or maybe diskless - where all N nodes are more or less fully functional computers in their own right. Each node runs an operating system, and normally all the nodes are configured to run the same OS configured exactly the same way (mod IP address or infiniband numbers and such).

You have a scheduler that runs jobs for you: you request of the scheduler M number of nodes for your job and you give the scheduler some kind of batch file script that invokes the executable you want run. The scheduler finds M nodes and runs your job by running your batch script on each of the M nodes. Some schedulers do this via ssh and host keys, others do it via a daemon on each node. As far as it goes, that is a cluster (other people may point out that there are MPI versions that include a multi-node launcher called mpirun and which don’t strictly speaking need a scheduler, but this is IMHO a fine point).

If you want your M processes to communicate with each other the cluster isn’t generally going to help you: it just is a scheduler launching processes on a bunch of computers. For IPC you need something else that knows how to communicate across ethernet or across infiniband or quadrics or what ever your cluster is wired with.

Historically lots of ways to do this were tried. One method that has become very successful and has gained a lot of support is called MPI. MPI is nothing more or less than a spec for an API (fairly language neutral) that lets you move data between your M process in a M:1, or 1:M, or P:Q manner (as long as 0 < P, Q <=M is true ;-), and also do stuff like synchronize the execution of all the processes or block/lock read or write access to data amongst the processes. If you have infiniband or the like for your cluster network then you can get MPI implementations that do the IPC over infiniband and which skip TCP/IP altogether - making for very fast and low latency comms between nodes. MPI is perhaps most used via a C language link library and headers.

So MPI helps you write a parallel program that runs on multiple separate computers and write it in a manner rather reminiscent of pthreads or of fork/exec+UNIX-IPC but with the extension that the IPC and sync primitives work across all the participating physically distinct computers in the job. But just like the OS on the computers, MPI rather assumes that the computers in question are what it is dealing with.

You propose to throw GPUs into the computers and have MPI know what to do with them. This does not exist AFAIK currently. There’s maybe no reason why some MPI could not be extended to know what to do with GPUs as well as CPUs, but I do not think any one has done this.

You can take a two step approach that ought to work: an MPI program also linked against CUDA, where each instance of the program invokes the kernel, waits using cudaThreadSynchronize(), and on return uses one of the MPI sync methods to force all processes to wait until all kernels are finished.

Yet, also, there is no reason why identical binaries in a cluster job need follow the same execution path: they certainly do not execute in lockstep like threads in a CUDA kernel. They are independent instances of the same program but where they go may be defined by the data they operate on. So, it might make more sense to use the GPU as a coprocessor for a traditional CPU executable in a cluster than to require all kernels to all end before the job as a whole moves forward.

But then again, it might not. It kinda depends what you want to do. :thumbup:

Thanks for your time Mr.Anderson and Emerth. It was really enlightening.

I have to tell you that your explanation answered all my questions and doubts on a cluster. It was realllly useful. Thanks a lot.

Thank you guys,

Best Regards,

I just read through the paper you mention in the first post. They describe a standard cluster using gigabit ethernet where the nodes are x86 machines each with some graphics specific add-in hardware: a GPU and a ‘volume rendering hardware’. But the program flow is regular x86 programming and IPC is MPI. The GPUs are not directing the program flow, they are used to speed up parts of the calculation.

Thanks for your note. I think they did not even have CUDA at that point. Musss have been very difficult for them.

CUDA Rocks Roll