We currently have CUDA software running on a single Tesla GPU. However we plan to subdivide the problem and run it on multiple GPUs.
Since considerable data transfer between the GPUs might be needed, we want to make use of CUDA 4.0 capabilities for multi GPU programming.
Apart from speed, also the software maintanance and transparent programming of multiple GPUs is crucial.
For a PC with 2 GPUs we don not see a problem.
However we have now been offered the possibility to configure/buy a system supporting upto 8 Tesla GPUs.
As far as I understand (I am by no means a hardware expert) the system basicall consists of multiple mainboards which are intelligently coupled and I know that such multi-processor systems are around.
Does anyone have experience with such systems, possibly with using multiple GPUs on them ?
Can I expect/hope that CUDA 4.0 will support more than 2 GPUs on such a system transparently (GPU-GPU data transfer, global address space etc.) ?
Or are there pitfalls/limitations to be expected in principle ?
I’m running a system with three Fermi GPUs. Those are regular GeForce GPUs, not Teslas.
If I were you, I would ask myself a question: are the CPU and other devices (memory, disk, etc) in this multi-GPU rig going to be fast enough to keep so many GPUs busy? In other words, given your particular application, what do you expect to be the bottleneck in this multi-GPU monster, the GPUs or some other parts of the system? This question is not really of great importance, if you’re utilizing low-cost GPU hardware, but given that you’re using expensive GPUs, it might be cheaper to invest in an extra box and keep the Teslas running full time, rather than saving money on the box and observing the GPUs sitting and waiting for the CPU, memory, or something else. For example, when running my application, my quad-core, hyper-threaded Intel CPU is 40% busy on the average serving the tasks to only three GPUs, and I consider my code to be heavy on the GPU, light on the CPU.
Reliability might be another important consideration. If you chose to build two 8-GPU machines, but only load 4-GPUs into each of them, then your system is redundant: if one of your motherboard burns, then you quickly switch all the GPUs onto the remaining motherboard and start working on fixing that broken box, while keeping all your Teslas in service.
This question becomes only harder, if you’re a development shop. In that case your future application might have unexpected, CPU-related bottlenecks.