It’s OK but you need N>1000.

N-body is such a compute-intensive task that in this particular case there may be no problem for multi-GPU. Here’s the proof:

imagine having several GPUs that need to communicate via PCIe at the speed of, effectively, X=2.5 GB/s (PCIe offers a maximum

6GB/s but in this situation it’ll be less). Each GPU will have 1 GB RAM associated with it, filled with data for N particles. Let’s assume that’s 10 float variables per particle,

for a total of 4*10*N Bytes. To send this data over the bus it requires

tx = 4*10*N/(2.5e9) s = 16*N/1e9 s

plus the latency and overhead which need to be in the calculation if the amount of data is too small. If you fill all your RAM with data, you need of order tx ~ 1/2.5 =0.4 s. The question now is: what is the tc, compute time?

Let’s denote the complexity of the algorithm as C [FLOP/particle]. N-body is an N^2 algorithm with not too many operations per particle, I’ll assume C~20*N FLOP. N-body can be solved on one gpu with a high efficiency of 600 GFLOP/s (see SDK example), so

tc = 20*N*N/(600e9) s = (N^2/30e9) s

(notice this gives a surprisingly good estimate of frames per sec in N-body SKD example: on 1 GPU it does 22 fps on my machine; N^2 = 1e9).

The ratio of communication to computation is

tx:tc = 480/N

Therefore, small-N systems (N<1000) will *not* be happy with multi-GPU, also for the reason of latency of PCIe operations that we skipped, but large N>1000 systems will be able to communicate quickly enough, to perhaps even feed the gpu’s with data continuously (overlapping comp and xfer). That’s the case with N=30000 SDK example.

This will of course be the same for other well-coupled algorithms, as long as they do a few dozen flops per cell/particle (they all do).

So it’s not so bad, we can use multi-GPU for those problems!

What’s the situation with CFD? It doesn’t need to transfer all the data, typically just boundary conditions which

are a O(1)*sqrt(N) number of Bytes. Then we should be ok, maybe… let’s return to this in a second.

But if we want to transfer all the data every time step, for instance to do some kind of combined FFT or something, then it’s much worse.

Multi-gpu may or may not make sense then. If we assume complexity of C = 100-1000 FLOP in the above calculation

(hopefully covering the range from low-order to high-order schemes), and take into account that there are up to 20 rather than 10 variables per cell in CFD, then

tx:tc ~ 200…20

totally comm-bound :-(

Will the multi-GPU work, and hopefully even wider network in a cluster as well, if we transfer only a fraction q of memory out of each gpuRAM

after a calculation? Only if q < 1/(20…200), right? The problem is that it’s a bit more. Let’s consider a typical setup.

In CFD on a square gird n x n = N, we typically need to transfer/receive (4…8) times the thickness of the boundary zone in cells (assume 2 only)

times n, all in all (16…32)n cells, times up tp 20 variables per cell, so in Bytes it’s (0.64…1.28)*sqrt(N) kB.

tx:tc = (0.05…1) *sqrt(1e7/N)

that is, with the whole 1 GB of RAM filled with data, we are able to have N=10 M cells, and tx:tc = (0.05…1) (good!)

But if we do a smaller array like N=1k x 1k only, then tx:tc = 0.5…10, it’s not so great… (the high estimate comes from arithmetic-light algorithms

which transfer data in 4 different directions (sometimes you can get away with just 2 and have boundary conditions on the other 2).

So, multiple-GPU should be doable, with care. Even a cluster may work, but it must be at least 20 Gbit/s infiniband I believe.

An additional problem is the dimensionality of data. In 3-D, a boundary two cells deep constitues a lot (an order of magnitud) more of the

total array than in 2-D!