New System Question

I’m designing a system to perform matrix multiplication, sorting and 2d convolution on a large number of grid cells and/or nodes. I’d like to run the system on a linux based operating system (preferably debian based). Is the 8600 GTS an appropriate choice for development of the system, and is it properly supported by the CUDA software? What form of linux appears to have the least issues with this software?

The 8600 GTS will work fine with CUDA, but the card is quite slow. Don’t expect any spectacular speedups with it. The GTX (or Tesla if you need the 1.5GB mem) is really the only route to take if you are looking at making a big compute cluster. If you absolutley need to keep costs down, the best performance/price would be the 8800 GT.

If you plan on running jobs in parallel across many nodes, I would also suggest doing a feasibility study of the scaling due to communications overhead before buying. In my application, Molecular Dynamics, the GPU is so insanely fast (speed of 32 CPU core cluster) that the communication required by adding a second node via 10Gb infiniband would result in an overall slower calculation. I’m going to explore the route of 2 and 4 GPU workstations first to keep the communication only between GPUs.

Depending on your applications computation/communication ratio, you may find different results than me.

Thanks, is the 8800 GTS supported in the current release of CUDA?

8800GT is supported since 1.1b.

(btw, it’s 8800GT but GTS…really confusing, isn’t it…haha~)

There are 4: 8800 GT, 8800 GTS, 8800 GTX, 8800 Ultra

The first of these is G92-based, the last 3 are G80-based.

CUDA is supported on all of these as well as all other GeForce 8x00 GPUs, Quadro FX 4600 and 5600, and all Tesla boards. Probably some other quadros I’m forgetting too.

Mark

If you have 2 GPUs then you should be able to run 2 such applications. Splitting one application to use cluster just because u need to use cluster may NOT really work. As you have more cluster nodes, I think you just schedule more and more applications and collate the results at the end using a single front-end. btw, I am NO cluster expert.

Right, the cluster use case pattern you describe is “embarissingly parallel”: running completely independent jobs on different nodes and then collecting the results. I often run many independent jobs like this on a cluster, except that each of my jobs takes up 32 to 128 processors, and still require 24 to 96 hours to complete! So you see why I wouldn’t want to just run each of my jobs on a single processor :) I don’t have months to wait for results.

You are correct that not all applications can be broken up across nodes, but I assure you Molecular Dynamics can, even up to tens of thousands of processors on a cluster.

Of course, the first multi-GPU mode I add to my software will be to run multiple instances of my application on separate GPUs. But I also want to push the performance envelope as far as I can by applying all GPUs to the same simulation, hopefully achieving near 128 proc performance with a single workstation so I don’t have to wait 4x longer for a single job to finish on the GPU vs if I had started it on the cluster.