How many C1060 per node


Our university is planning to buy some C1060s. As we will have many users, there is no single kind of application that will be executed but probably several applications from different domains. I assume that the CUBLAS library will be fairly popular. Can you give us some recommendations regarding useful configurations?

  1. Given a certain budget, would you rather buy fewer compute nodes with 4 C1060s each, or more nodes with one C1060 each? (or something in between)

  2. In case of one C1060 per node, we might use MPI for larger problems. Is it possible to use four C1060s per node with pthreads or does this require MPI as well? Or do you control four C1060s sensibly even with a single-threaded application?

  3. Would you spend money on fast CPUs or rather buy more C1060s? Does it make sense to buy two CPUs per mainboard?

  4. How much memory would you recommend per C1060? Do you use the entire 4 GB per C1060 in your applications? Or would you recommend even more? How important do you consider it to maintain 1333 MHz on the FSB?

As you can see, we are interested in some general hints how to setup a cluster that will be used by an entire university.

Thanks a lot. Your advice is highly appreciated.

Best regards,


I should preface my remarks by saying that my only experience with GPGPU computing on a distributed memory cluster is on a “workgroup” scale cluster with eight compute nodes, built out of consumer parts in house. And we are at the beginning of the learning curve on moving many of our applications across to CUDA or CUDA/MPI hybrids. Most of our codes are domain decomposition based, and we are offloading serial or mutlithreaded CPU subdomain level calculations onto the GPU to improve performance.

Personally, I would favour a 1:1 GPU:CPU ratio, for two reasons: firstly because multi-GPU programming isn’t easy and on a general purpose cluster I doubt it would constitute enough of a fraction of your job mix to justify it, and secondly because scheduling GPGPU jobs on the cluster isn’t simple, and multi-GPU jobs (or multiple single GPU jobs on a single node) is even harder.

Although I don’t have any experience with it, others have certainly reported good results using pthreads for mutli-GPU programs. You certainly need one host CPU thread or MPI process per active GPU. I don’t see how it could feasibly work with a unitary threaded process.

For a general purpose cluster, I would still be looking for the best host CPU per core performance you can afford. There will be plenty of applications that still benefit from fast CPUs, or can’t exploit the GPU. A two way node motherboard certainly increases density significantly, which can be an important factor on a very large cluster. For us, it wasn’t justifiable, but for large multithreaded applications it might be.

You really should have as much free RAM on the unloaded node as each GPU. In my experience, nodes run well with problems that require a maximum of roughy 80% of the node memory. Go higher than that and you will start to see the node performance degrade. In practice, that probably means the host should have something like 1.2 times the sum of the GPU ram, although that is pure speculation on my part. We are only using a single consumer GPU per node, and our 3Gb nodes run well with apps up to about 2.5Gb footprints. For larger GPUs and applications it might vary.


My company will probably choose the GTX295 for a couple of reasons:

a. C1060 is ~ twice the price of a GTX295 (with the 50% off program its probably the same), but you get 2 GPUs in the GTX295

  so price wise its either 2 or 4 times cheaper.

b. performance of one half of the duals should be the same as one C1060.

c. if the RAM on the GPU is not an issue for you (you can squeeze your data on the ~800MB of ram) then the 4GB of the C1060

 is not an issue.

d. If you’re not a real production env - the 295 should be safe enough.

e. you should take into account though that the 295 has no support from nVidia and no warranty and such.

Obviously nVidia recommends the C1060 for obvious reasons. You can google the groups here, where a guy has 32 GPUs all using GTX295

and it works 24x7.

Look in the forums for Multi-GPU you’ll find more then enough information (specificaly the GPUWorker from MrAnderson42).

I use pthreads to work against the GPUs, and it works perfectly :)

MachineA: GTX285 and C1060 - two threads

MachineB: Intel 2 quad cores, 3 GTX295 - 6 pthreads

MachineC: Amd 1 quad core, 4 GTX295 - 8 threads.

basicaly I think it would be better to have one CPU per GPU (per one half of the GTX295 in my case)

However I’m not sure its realy a must and probably depends on your application. You should try and test it.

The simplest answer as much as possible :) but it realy depends on what your app does.

As for the RAM on the GPU, again app depend. I am able to break my dataset into small portions so on a GTX295 with only 800MB

I run a small portion of the dataset at one time and on the C1060 I dynamically (in code) give a bigger dataset to the GPU to work on.

If you cant break your app to small datasets you might be forced to use the C1060 with the 4GB

If you can get your hands on a test environment with different configurations that would be the best from my experience.

Maybe your local nVidia branch can assist with that.

I’d also vote for GTX 295 unless your applications will really need 4 GB or RAM.
I would not recommend trying get 4x C1060 to work in one workstation. This is possible, but there are problems:

  1. PSU and cooling
  2. Some BIOSes will require you to have as many system RAM as total video RAM
  3. It’s preferable to have quad-core CPU to drive 4 C1060
  4. You’ll need to find extra PCIe slot for compatible display adapter.

I’d probably choose workstations with 2x C1060, 8+ GB RAM and quad-core CPU.

Thats true and why I moved to linux. Our code already ran with regular CPU threads on both windows and linux.

However if you use linux and no X you can use all cards + you dont have to worry about the dreadful watchdog ;)

Also, our tests showed that the GTX280 ran ~30% faster then the C1060 so if this is an issue, its something else to consider.