I’m not sure if this question was answered before, if it was, I’m sorry for repeating it. There are a number of similar topics on the forums, but they don’t exactly answer my questions.
Even though I’m not new to programming, I’m very new to GPU programming. I’m involved in research around Large Eddy Simulations (LES) and Direct Numerical Simulations (DNS). These type of analysis is very similar to Computational Fluid Dynamics (CFD) calculations, except that I’m dealing with massive size grids (well over 100 million cells). The problem with these types of calculations is the fact that they also require massive computational power (standard computer clusters are usually not enough to complete so many calculations). This is where there is a need for massive parallel computer, ie. CUDA and GPU computing. The code is based around Fortran and some visualization with Matlab and does require double precision.
My question is: For these types of analysis, would I be better off with a commercial graphics card with less memory but faster speed and bandwidth (GTX 280) or with a slower HPC card with more memory (Tesla C1060)? There is some visualization involved, by the main concern is about doing the calculations.
At this point I’m involved in creating a proof of concept code that will showcase the power of GPU computing, since my adviser and the people who we do research for (NASA, Boeing, and DoD) are very skeptical of this. If they like what they see, there will be a number of grants coming my way to build a small scale cluster to do even more research.
This brings me to question number two: I know that I can build a computer with four video cards to do even more calculations and generate over 4 TFlops of processing power and linking them using gigabit networking. Would I be better off building a few of those PC’s or getting a couple of the S1070 U1 servers? I looked through the S1070 specs sheet, but I didn’t find any information on how can I have more than one. Can I daisy-chain them or do I need a separate PC for every S1070. If I do need a separate PC for each one, what specs should I be looking for?
My last question is about the compiler: Since all of the code is in Fortran, it gets too time consuming to convert from Fortran to C. I’ve read rumors on the web that nvidia is planning to release a Fortran compiler for CUDA to ease the altering of existing code to work with CUDA. My question is when can we expect this compiler to be out on the market (I’ve read that it will be released by the end of the year)?
You have to find out how much memory your algorithm will actually require. (These things vary by orders of magnitude, and chances are it will either completely fit in both the 280 and Tesla or that it won’t fit in either.) It is often possible to write code that is flexible about memory availability, and can do work in pieces.
(This isn’t like Vista, where we all kind of know 2GB is what a person needs. Since you write the code yourself and the algorithm could be anything, there’s absolutely no way to set a baseline.)
Using an S1070, you can connect 8 GPUs to a PC instead of 4. (Each S1070 connects via two PCIe slots.) The only other advantage is form-factor-related. (It is 4 GPUs in 1U and supports PCIex8.) You still need multiple PCs to scale up further. (In any case, please keep in mind that the CPU still does its share of work in most CUDA apps and host-device bandwidth is a bottleneck for many, so attaching 8 or even 4 GPUs to a CPU can be inaappropriate. But this, again, depends completely on your algorithm.)
As well they should be ;) More seriously, it’s easy to port code to CUDA, harder to do it well. It’s not so much lots-of-work difficult, or learning-a-lot-and-practicing difficult, but being-clever difficult. So it’s doable. Just study the Guide really well.
Do you know how many S1070’s can I attach to one PC? I assume that the controller card for the S1070 goes into a PCIe x16 slot. Also, what are the recommended specs for a PC to run the S1070 effectively?
The S1070 is designed for server rooms, you don’t want one sitting next to you.
I would start with a C1060 for the initial porting and optimization, with the final goal of deploying on a cluster with S1070s.
What is the numerical method you are using for your DNS/LES?
S1070 needs two connectors. It can use either 2x PCIe x16 or 2x x8 (which is handy for servers). So, at maximum, you can attach two.
EDIT: According to Engadget, it looks like there’s a more expensive S1070 that has one connector. (In keeping perfect form with this being an over-priced niche product, NVIDIA keeps most information about it off its public webpage. Good job, NVIDIA.)
The numerical method is based off Reynolds-averaged Navier–Stokes (RANS) equations, except that it’s optimized for DNS/LES analysis. This is done by “ignoring” the turbulence models that RANS analysis was designed to solve and instead investigating massive 3D grids to analyze the turbulence eddies. To do that, the cells are extremely small, thus resulting in grids that are sometimes over 250 million cells. Because that it will need to do a LOT of calculations, I wasn’t sure if I needed more memory or bandwidth and speed.