Our university is planning to buy some C1060s. As we will have many users, there is no single kind of application that will be executed but probably several applications from different domains. I assume that the CUBLAS library will be fairly popular. Can you give us some recommendations regarding useful configurations?
Given a certain budget, would you rather buy fewer compute nodes with 4 C1060s each, or more nodes with one C1060 each? (or something in between)
In case of one C1060 per node, we might use MPI for larger problems. Is it possible to use four C1060s per node with pthreads or does this require MPI as well? Or do you control four C1060s sensibly even with a single-threaded application?
Would you spend money on fast CPUs or rather buy more C1060s? Does it make sense to buy two CPUs per mainboard?
How much memory would you recommend per C1060? Do you use the entire 4 GB per C1060 in your applications? Or would you recommend even more? How important do you consider it to maintain 1333 MHz on the FSB?
As you can see, we are interested in some general hints how to setup a cluster that will be used by an entire university.
I should preface my remarks by saying that my only experience with GPGPU computing on a distributed memory cluster is on a “workgroup” scale cluster with eight compute nodes, built out of consumer parts in house. And we are at the beginning of the learning curve on moving many of our applications across to CUDA or CUDA/MPI hybrids. Most of our codes are domain decomposition based, and we are offloading serial or mutlithreaded CPU subdomain level calculations onto the GPU to improve performance.
Personally, I would favour a 1:1 GPU:CPU ratio, for two reasons: firstly because multi-GPU programming isn’t easy and on a general purpose cluster I doubt it would constitute enough of a fraction of your job mix to justify it, and secondly because scheduling GPGPU jobs on the cluster isn’t simple, and multi-GPU jobs (or multiple single GPU jobs on a single node) is even harder.
Although I don’t have any experience with it, others have certainly reported good results using pthreads for mutli-GPU programs. You certainly need one host CPU thread or MPI process per active GPU. I don’t see how it could feasibly work with a unitary threaded process.
For a general purpose cluster, I would still be looking for the best host CPU per core performance you can afford. There will be plenty of applications that still benefit from fast CPUs, or can’t exploit the GPU. A two way node motherboard certainly increases density significantly, which can be an important factor on a very large cluster. For us, it wasn’t justifiable, but for large multithreaded applications it might be.
You really should have as much free RAM on the unloaded node as each GPU. In my experience, nodes run well with problems that require a maximum of roughy 80% of the node memory. Go higher than that and you will start to see the node performance degrade. In practice, that probably means the host should have something like 1.2 times the sum of the GPU ram, although that is pure speculation on my part. We are only using a single consumer GPU per node, and our 3Gb nodes run well with apps up to about 2.5Gb footprints. For larger GPUs and applications it might vary.