CUDA on 2x260 what hardware is the best for me?


I am about to order a hardware for heavy CFD simulations. The main part of this simulation will be carried out on GPU. So, I would like to have 2 times 260 GPUs. Additional requirements, really large amount of main memory and at least 8 cores of CPU (I have mixed CPU/GPU solver).

So, I find out one configuration but I do not know does it work with CUDA, (I mean 2 GPUs together or not).

The configuration has the following items:

  1. Tyan Thunder n3600M, nForce Pro 3600 (dual Sockel-F, dual PC2-5300R reg ECC DDR2) (S2932G2NR) 2x PCIe x16
  2. 2x AMD Opteron 2350 B3, Sockel-F boxed, 4x 2.00GHz, 4x 512kB Cache, 2MB shared L3-Cache (OS2350WAL4BGHWOF)
  3. 16x Aeneon DIMM 4GB PC2-5300U CL5 (DDR2-667)
  4. 2x Colorful GeForce GTX 260, 896MB GDDR3, 2x DVI, TV-out, PCIe 2.0

Please, advise me can I see 48 multiprocessors and 1.7Gb of GPU memory on this configuration or I should consider other hardware?

EDIT: forget to say that this computer will run under Linux.

Thank you!



That motherboard should work but note it has only PCIe 1.0 x8 slots. Ideally if you’re transfer limited at all, you want x16 PCe 2.0 slots.

If you’re going to invest the $ in such a powerbox, why did you pick the GTX260 and not the 280 or a Quadro?
Extra speed is good, but GPU memory can be even more limiting.

It all depends on your application’s bottlenecks of course.

Hi, SPWorley,

thank you for your kind answer!

You are right about 280, it could be faster, but there is only about 10% increments in the GPU memory. My application is bounded on the GPU memory, not on GPU performance :( It sums large vectors, and since GPU memory 50-100 times faster than the main memory of CPU, my application works also 50 times faster. I did not consider Quadro or Tesla because it is very expensive… (3000$ for Quadro, and only 300$ for 260, am I right?) I want to build this computer with the budget about 3-4K$.

I picked up PCIe 8x only because I did not find a mother board with 16 DIMM slots and 2 PCIx 16x.

What is planning to do, is a development and small computations of supersonic (20Machs) Bltzmann simulations with one new deterministic method. It requires a huge memory, and the computations are organized such a way that:

  1. 80% time is running GPU kernel with maximally large data;
  2. 10% is data transfer between main memory and GPU (also can be fully asynchronous with the point 1 but not yet implemented), in the case of PCIe 8x/16x it is still not so difficult for me, since it can be asynchrony! Actually
  3. 10% of double precision computations that is impossible to port to GPU because of very complicated mathematical algorithms.

Hence, I need:

  1. the main memory as much as possible (64Gb in my configuration)
  2. GPU with 1.3 CUDA capability (otherwise my kernels does not work)
  3. maximal possible amount of GPU memory (the difference between 260 and 280 is only 10%)
  4. 8 cores on CPU are using only for faster memory access of my 10% double presision CPU computations.

Finally, it seems that I tried to explain why such a strange hardware is considered :)

Now, what I need, is to know:

  1. what precisely should have the mother board to support 2 GPUs like 260/280,
  2. how these GPUs will be visible on CUDA,
  3. is it true for my configuration.



I wonder, do you really need the GPU memory? It sounds like you already worked out how to block your computation, transfering parts of the data at a time. (This is the hard part, not everyone does this.) If this is indeed what you’ve done, then more GPU memory will only work to decrease the fraction of the time spent doing device-host transfers, which is already low. But if more GPU memory really will help you, then the Tesla C1060 has 4GB. I think the C1060 is under $1500.

Also, your 2 cards will be seen as two seperate cards and you have to manually divide the computations between those two cards. So you can process 896MB (a little less) per kernel-call
C1060 as said has 4 Gb of memory with 102 GB/s bandwidth, and looks to be an ideal option for you as far as I can see.


yes, thank you! It seems that 1xC1060 will be enough and not so expensive. In this case I can play with different motherboards, while I need only one PCIe x16. The only one thing that I should do, is to find where to buy it in Germany.



yes, thank you for your kind adwise about Tesla C1060. It seems that it the solution for my problem.

Yes, the property of the algorithm is very nice, it takes small amount of data (ca. 5-100Kb) and play with them a lot (ca. 10^6 Flop). This procedure I should make for each finite element in my unstructured grid. Hence, I need only good data transfer procedure between the main memory and GPU, and since I constantly develop out-of-core linear system solvers since 1993 I have some ideas how to do it :)

To make better piping data to GPU and back I need more memory. To make most of calculations on GPU I need 1.3 CUDA because of mixed float/double solver.

In the case of your real interest of my algorithm, I would like to ask you to wait a little bit, I am about to submit this to CUDA portal.