Multi-GPU: A must in HPC?


In this article (
h_in_2012_Jon_Peddie_Research.html), the author states that in 2012, nearly half of the PCs will utilize multi-GPU system. I suspect that for computational machines, this percentage would be much higher (anyone want to corroborate on this as well as provide me with a rough estimate on the percentages?). Given this information, will most of the CUDA codes in HPC (an area that I’m interested in) utilize multiple GPUs in nearby future? Also, would it become easier for programmers to write CUDA codes for multi-GPU? Thanks!

When I first started CUDA development, I had a feeling that multi-GPU was kind of fringe and rare, not too critical.

But I’ve learned that in practice it’s a great multiplier. IF you design around the idea of multiGPU to start with, you can really get huge scaling speedups.
So now it’s almost my default expectation is to use 4 GPUs (2 GTX 295 boards) for my apps.

If I spend 3 months getting a big app designed, coded, and debugged, why shouldn’t I spend an extra $300 at the end to double its speed?

Sure, some apps likely won’t scale well with multiple GPUs, but the ones that CAN scale SHOULD be designed to scale.
Just think about it at initial design time… sometimes you can splice multi GPU support in later, sometimes you can’t.

Multi-GPU issues are over-rated in my opinion. If someone can’t manage opening a CPU/host thread, calling cudaSetDevice on it

and keep the thread live and kicking as long as needed and run the kernels on this thread - I’m not sure he can cope with 1000s of

concurrent threads on the GPU.

As for multi-gpu in hpc - this is of course going an already is the trend. I have 2 S1070 (8 GPUs) connected to one machine

in production and looking for 3 S1070 (12 GPUs) per machine. It will cut the hardware cost. as many GPUs per host machine,

the less money you put on the hardware (as long as its working fine :) )


I think that article is a bit misleading. There will be a lot of multi-gpu PCs, but not so many with dual, high performance add in board GPUs. Both major players in the desktop CPU market are rapidly going down the path of integrating a gpu onto the CPU socket or silicon. It makes sense that “hybrid” schemes will become the norm, using the high performance card for 3D or GPGPU, and the closely coupled GPU for 2D display, video and other stuff which doesn’t need a lot of shaders or can be done efficiently in fixed function logic. NVIDIA have or are about to announce “Optimus” which is basically a way of making the Arrandale/Clarksdale GPU coexist with their add in boards and discrete mobile GPUs. AMD/ATI already have something similar (forgotten what it is called, PowerXpress maybe?).

Certainly multi GPU for HPC is and will continue to be important, but I am not sure the desktop version of it will look how you might imagine.

IMO it’s not about just launching a host thread and dealing with the API, it’s more about partitioning the algorithm (the one implemented in kernels). Sometimes it’s trivial, sometimes it’s nasty and won’t scale well (ex. how would you go about multi-GPUing nbody?)

The article doesn’t really cite any reliable sources but I’m afraid consumer desktops won’t have multiple GPUs anytime soon. It’s a chicken and egg problem, the market is driven by video games and the developers know gamers don’t have multiple GPUs (or at least not more than a fraction of a percent does) so they don’t design their games with multi-GPU in mind, and there’s no incentive to buy more GPUs if games don’t support it.

HPC is a completely different story.

Don’t forget the low end, though. Buy an Apple Macbook and you get both a 9400M and a 9600m powered GPU… giving flexibility especially in terms of wattage use. Nvidia’s brand new Optimus technology, announced today, is also bringing multiGPU down to the lowest end… even mixing vendors GPUs.

So multiGPU has lots of advantages not just at the high end HPC market but also at the low end. Other niches have their own as well… NVidia is successfully pushing gamers to buy a second GPU just to devote to PhysX for example.

Finally, look at Fermi’s architecture direction and you start seeing some more versatile kernel scheduling… combine that with its machine-wide shared 64 bit memory spaces and you start leading to a future where kernels can even migrate from GPU to GPU as needed. That’s not here yet, but it certainly is a powerful path to follow and I expect to see it happen.

Hum, using multi-GPU is an order of magnitude difference than using just one and optimizing for it. In fact, logically, each GPU is part of a local Cluster, that communicate with a bus (PCI-express instead Ethernet or anything), with it’s own computing and local memory resource.

Using multi-GPU is for me of the same order of magnitude as programming a Cluster, and thus is also interesting because clusterisation of multi-GPU enabled program is usually a trivial task (when no PINNED SHARED MEMORY is involved).

And this is probably a way OpenCL may explore, because the “kernel” abstraction where INPUT, OUTPUT and algorithm representation (wether in C, C++, PTX or anything else) are buffered (or may be) and are isolated from the rest of the computer resource is basis to cluster communication.

PS: And I own a MacBook Pro 17" with GeForce 9400M and GeForce 9600M GT, different GPU with different capability (PINNED MAPPED MEMORY for IGP 9400M MCP79; and 2X to 3X performance gain on 9600M GT) and it’s incredibly challenging to use them at once and do dynamic load-balancing of kernel to gain up to 30% over the fastest 9600MGT used as single-gpu :sweat:

will most of the CUDA codes in HPC (an area that I’m interested in) utilize multiple GPUs in nearby future? Also, would it become easier for programmers to write CUDA codes for multi-GPU?

yes (it’s saving money). no (until common address space for graphics boards becomes available.)

for now the programs will be using a hierarchy of languages and libraries, I’m afraid (CUDA + openMP or pthreads + MPI, from low to high level of memory architecture).
I looked at the UPC language which, like other cluster-specific languages and extensions, tries to provide an easy way to operate non-local data, and I don’t know
if it would be easy to use. for one thing, it’s C not C++, and there are some interesting C++ libraries to be had.
anybody who can comment?

I’m totally not thrilled about openCL, so if somebody disagrees, I’d be happy to hear what kind of performance he/she gets in HPC or otherwise, compared with cuda. ATI cards are of course cheaper & nominally fast, even faster than nVidia’s, but in practice (and that means openCL) something’s not working right. I can’t find any convincing success story in HPC so far, more like copies of PR emitted by openCl consortium with all sorts of impossible promises. besides, openCL is not solving any HPC problem, possibly a commecial software vendor problem targeting multiple platforms, but not HPC, which should be designed and built from homogeneous elements.

please put 4 gtx295 in a box and let me know if your memory and chips on the motherboard are thermally stable. I know with 6 gpu’s it works ok, I’d like to know about 8 gpus in a home-made machine (half the price of a rack mounted setup, a bit less RAM).

Thats correct, I was mainly refering to the questions about multi-gpu found in this newsgroup. Many are just problems of

creating threads and managing them not so much as how do I break my algorithm to run on 2 GPUs…

I haven’t looked at nbody (not my field :) ) - what’s the problem there with regard to multi-gpu?

In my line of business I’ve heard about companies waiting for a 12GB GPU (12!!! - this is at least 1-2 years away) till

they move their code to GPU - because they think (or its a bit harder) they can’t send the data by chunks.

I’ve heard the same argument when moving from 8 instances of the same exe to 1 exe with 8 threads…

Most of the times I’ve seen its indeed not straight-forward (plus the PCI sharing/overhead) but doable…

at least this is my experience :)



The simplified algorithm goes like this:

loop {

foreach(Body b) {

	forceVec = (0, 0, 0);

	foreach(Body other) {

	   forceVec += computeForce(b.position, other.position);


	b.position += integrate(forceVec);



So, each body needs to access all other bodies’ positions (means we have to replicate the entire dataset to all GPUs) and after each iteration all bodies have new positions (which means we need to propagate changes to all bodies to all GPUs, through PCIE). This boils down to copying the entire dataset back to the host for synchronization and again to all GPUs after each iteration. There’s no way around this that I know of.

It’s OK but you need N>1000.

N-body is such a compute-intensive task that in this particular case there may be no problem for multi-GPU. Here’s the proof:
imagine having several GPUs that need to communicate via PCIe at the speed of, effectively, X=2.5 GB/s (PCIe offers a maximum
6GB/s but in this situation it’ll be less). Each GPU will have 1 GB RAM associated with it, filled with data for N particles. Let’s assume that’s 10 float variables per particle,
for a total of 410N Bytes. To send this data over the bus it requires

tx = 410N/(2.5e9) s = 16*N/1e9 s

plus the latency and overhead which need to be in the calculation if the amount of data is too small. If you fill all your RAM with data, you need of order tx ~ 1/2.5 =0.4 s. The question now is: what is the tc, compute time?

Let’s denote the complexity of the algorithm as C [FLOP/particle]. N-body is an N^2 algorithm with not too many operations per particle, I’ll assume C~20*N FLOP. N-body can be solved on one gpu with a high efficiency of 600 GFLOP/s (see SDK example), so

tc = 20NN/(600e9) s = (N^2/30e9) s

(notice this gives a surprisingly good estimate of frames per sec in N-body SKD example: on 1 GPU it does 22 fps on my machine; N^2 = 1e9).
The ratio of communication to computation is

tx:tc = 480/N

Therefore, small-N systems (N<1000) will not be happy with multi-GPU, also for the reason of latency of PCIe operations that we skipped, but large N>1000 systems will be able to communicate quickly enough, to perhaps even feed the gpu’s with data continuously (overlapping comp and xfer). That’s the case with N=30000 SDK example.

This will of course be the same for other well-coupled algorithms, as long as they do a few dozen flops per cell/particle (they all do).
So it’s not so bad, we can use multi-GPU for those problems!

What’s the situation with CFD? It doesn’t need to transfer all the data, typically just boundary conditions which
are a O(1)*sqrt(N) number of Bytes. Then we should be ok, maybe… let’s return to this in a second.

But if we want to transfer all the data every time step, for instance to do some kind of combined FFT or something, then it’s much worse.
Multi-gpu may or may not make sense then. If we assume complexity of C = 100-1000 FLOP in the above calculation
(hopefully covering the range from low-order to high-order schemes), and take into account that there are up to 20 rather than 10 variables per cell in CFD, then
tx:tc ~ 200…20
totally comm-bound :-(

Will the multi-GPU work, and hopefully even wider network in a cluster as well, if we transfer only a fraction q of memory out of each gpuRAM
after a calculation? Only if q < 1/(20…200), right? The problem is that it’s a bit more. Let’s consider a typical setup.

In CFD on a square gird n x n = N, we typically need to transfer/receive (4…8) times the thickness of the boundary zone in cells (assume 2 only)
times n, all in all (16…32)n cells, times up tp 20 variables per cell, so in Bytes it’s (0.64…1.28)*sqrt(N) kB.
tx:tc = (0.05…1) *sqrt(1e7/N)
that is, with the whole 1 GB of RAM filled with data, we are able to have N=10 M cells, and tx:tc = (0.05…1) (good!)

But if we do a smaller array like N=1k x 1k only, then tx:tc = 0.5…10, it’s not so great… (the high estimate comes from arithmetic-light algorithms
which transfer data in 4 different directions (sometimes you can get away with just 2 and have boundary conditions on the other 2).
So, multiple-GPU should be doable, with care. Even a cluster may work, but it must be at least 20 Gbit/s infiniband I believe.

An additional problem is the dimensionality of data. In 3-D, a boundary two cells deep constitues a lot (an order of magnitud) more of the
total array than in 2-D!