Advice on first CUDA system

I have a lot of experience in parallel and some massively parallel computing. I have not tried CUDA yet though.

At the moment I have several applications which are candidates for massively parallel computing and I’d like to try CUDA. At the moment I think I’d like to start with Matlab+Jacket. At the moment I use Matlab on 64 bit Vista or Windows 7.

I already have a Dell Precision M4400 - Quad Core 2.53GHz, 4GB RAM, nVidia Quadro FX 1700M (512MB), which already has Windows Vista 7 64 bit and Matlab. I am under the (possibly naive) impression that all I have to do is add some sort of CUDA software, a trial copy of Jacket, and then I should be able to see what CUDA is like, although not the highest performance.

So my first question is whether this sounds like a feasible idea? Would I be better off with different software? I’m fluent in all sorts of things (C, Fortran, Python, R) so if there is an easier first try I’m willing to listen.

My second question is would it make sense for me to have a straightforward workstation built with one or two Tesla C1060 cards? I would rather not go for the usual workstation builds because they are LOUD, and I have previously had good luck with a company (not named to avoid spam) that build very quiet gaming systems, DAWs, video boxes, etc., so this would probably be simple for them.

In case people are interested, my anticipated areas of application are:

Very large scale portfolio selection.
High complexity signal processing.
Combinatorial optimization.
Number Theory.
Solid state physics.

The point of that is whatever sort of strengths and weaknesses that I find on CUDA, I expect at least one of these will work well on the platform, so I’m willing to get the system before I have a lot of experience with CUDA on the laptop. Is that a mistake?

CUDA’s strengths are most evident when you can apply the same instruction sequence to many independent data elements in parallel, and threads can access memory in long, contiguous blocks. It is also important to be able to amortize the time required to copy large data blocks from the CPU to the GPU over the PCI-Express bus, which is by far the slowest link in the chain. If you can copy a large block of data to the card at the beginning, then reuse it in many kernel calls with relatively small transfers of additional values to/from GPU memory, you will see a great benefit with CUDA.

As for building a CUDA workstation, if quiet is important, then your plan sounds good. Although, I don’t know what your budget is, but if you are just evaluating CUDA, you might want to start with a GTX 285 rather than a Tesla card. The GTX 285 is slightly faster, and vastly cheaper. The only drawback is 1 GB of memory instead of 4 GB (although you can find GTX 285s with 2 GB of memory now). If you know you need > 2 GB of device memory, then Tesla is the only way to go.

Since you are very experienced in parallel programming, my only other advice is a warning: CUDA is not pthreads/MPI/Cell/etc. Despite terms like “thread” and “processor” in the manual, CUDA is very different than normal multithreaded programming you are accustomed to performing on multicore processors or large grids. The mismatch between MPI-like expectations and the actual design of CUDA can cause a lot of frustration.

For example: many people read the specs on the GTX 285 (or Tesla) and assume that you have a 240 core chip, where each core has a single precision floating point unit, and they try to program it that way. In fact, you really have a 30 core (“multiprocessor”) chip, where each core has an 8-wide SIMD single-precision floating point unit, a 2-wide special function unit, and a 1-wide double precision unit. A “thread” is just a lightweight register context from which SIMD elements are drawn, and threads are packaged into groups of 32 (the “warp”), with one instruction pointer for the group. 32-wide SIMD instructions are folded up and pipelined into the 8-wide SIMD unit on each multiprocessor.

Since thread resources are statically allocated at the start of kernel execution, there is no thread switching overhead. Just like oversubscribing a multicore processor with threads can make sense if you are limited by disk or network I/O latency, in CUDA it is essential to oversubscribe the SIMD units by factors of 16 or more to hide the latency of the 1 (or 2 or 4) GB of off-chip memory. This is usually what boggles people: a chip with 240 “processors” operates at peak efficiency when 10,000 “threads” are active.

You might have already figured all this out from the manual, but I wanted to emphasize the non-MPI aspects of CUDA since they can be very frustrating for new users.

Yeah, everything is memory bandwidth bound. It’s funny that back in 1990 we knew that would happen. But I’m OK with that since it’s nothing new. I would try and pick the fastest motherboard for bus in this case. I’m looking at the Asus P6T Deluxe V2 at the moment.

That’s an interesting thought. The problem at the top of my stack is a classic case for MPP - “embarassingly parallel” with no interprocess communications. At the moment the requirement is about 1MB per FPU, so it would fit on the GTX 285.

Can one mix and match? Say some GTX285 cards and one Tesla in the same system? Physically it seems clearly possible, do the coexist well?

When you say that GTX285 is faster than Tesla, I don’t understand that from the specs. The CTX 285 has a core clocked at 648 MHz and the Tesla C1060 clocks at 1.3 GHz, so I would have guessed that the Tesla is more than twice the performance of the GTX 285.

Well that’s OK, I’ve never liked mpi. There is one really nice parallel model I like (symmetric multiprocessing with shared memory) but that’s not very widely available.

Most of the time I will try and just “coast” with Jacket - so it will do all the mapping from MATLAB to CUDA, although to predict performance I will have to know (typically in some cartoon sense) what is “really going on”. From time to time, I might end up writing my own CUDA code, but I don’t expect that to be most of the time. (Although the first project would probably be an exception). Since I am usually working in array languages, operations are already vectorized pretty uniformly, and it’s almost enough if those operations are parallelized in some reasonable manner. That limits the explicit parallelization I have to do “by hand” to a small fraction of the time.

The Quadro FX 1700M is CUDA-enabled, so you could start playing around with CUDA right now if you wanted to (though the performance won’t be even close to what the desktop cards offer). See this page:

http://www.nvidia.com/object/product_quadr…_1700_m_us.html

Unless you know that you’re going to need to process massive datasets on CUDA, don’t worry about getting a Tesla right off the bat. I’d do what seibert said and just get a GTX275 or GTX285 for now, so you can evaluate the real speed of the cards. Then, when you’ve impressed everyone at the office, you can get some higher-end stuff ;)

If you build something though, make sure it has a lot of cooling (several large fans, or even water cooling, if you want to spend that kind of money), and also make sure that it’s got at least 2 or 3 PCI express slots (in case you want to add another card down the road, you’re not stuck on a board with a single slot). Make sure that at least two of the slots run at x16 speed (some 3-slot boards are x16/x16/x8, in which case you could have two compute-only cards and a lower-end card for your display).

If you are pretty familiar with MATLAB, then Jacket might be a good place to start. Eventually, I would think that you will want to start writing and tuning your own kernels in pure C/CUDA though for best performance (once you are familiar with CUDA’s strengths and weaknesses).

Final note…make sure to read the boards as often as possible. There’s a lot to learn from the ‘pros’ and nVidia employees here!

Aren’t the processor cores of the GTX 285 clocked at 1476MHz ?

N.

Once you get your data onto the GPU, this is where it really shines. The on-board memory bandwidth on one of the GTX 200-series (and Tesla) cards is easily more than 5x that of the CPU system memory. It’s fantastic for cases where your working set won’t fit into the L3 cache on the CPU.

Yup, people do it quite frequently. Just make sure you have enough power for all the cards.

You’re comparing different clocks here. NVIDIA GPUs have 3 clocks: core, shader, and memory. The shader clock is the one to pay attention to, because that’s what drives the floating point units. (The core clock seems to control other parts of the chip, but doesn’t directly factor into the FLOPS calculation.) The GTX 285 has a 1476 MHz shader clock, and the Tesla has a 1300 MHz shader clock. This table has a great comparison of the different GPUs:

http://en.wikipedia.org/wiki/Comparison_of…orce_200_Series

(If at this point you are wondering why someone would buy the C1060, all I can say is “more RAM, more testing, more enterprisey”. :) )

And just for completeness in this thread, the memory clock on the 285 is 1242 MHz and on the Tesla it is a whopping 800 MHz. This is the biggest reason that the GTX 285 is so much faster… At least least for device memory bandwidth bound steps in apps (a majority in my experience).

I have the following provisional configuration:

Mobo: Asus P6T Deluxe V2

CPU Core i7 920

12GB system RAM: 6x2GB DDR3

Video Card: XFX GeForce 9400GT 512MB “Silent”

Corsair HX 1000 Watt power supply

The quiet system builders will make this system, and I could also have them stick a GTX 285 in as well, which I am seriously considering. They don’t have access to the Tesla card deal.

The motherboard has these slots:

Slot 1 PCI Express x4
Slot 2 PCI Express 2.0 x16
Slot 3 PCI
Slot 4 PCI
Slot 5 PCI Express 2.0 x16
Slot 6 PCI Express 2.0 x16
Slot 7 Slot not available on this board

So we think this will hold two cards for CUDA - either a pair of GTX 285s, or Teslas, or one of each. I think that will do for my personal computing needs for the forseeable.

The idea of the 9400GT was that it will be better if all the video cards are nVidia, but to have a display card which is very quiet and uses little power. Is that a good idea or should I opt for yet another CUDA capable card?

Anyone see any obvious blunders?

This setup sounds fine, especially if your target OS is Windows, where mixing manufacturers only works for some OS versions. (I can’t keep track which among XP, Vista and Win7 support mixed drivers.) I would say try things out with the 9400GT and the GTX 285. If it works well, then think about another 285, assuming your problem can be partitioned to multiple cards easily. Just-in-time purchasing is usually a good idea in this field. :)

I agree with the JIT purchasing for the cards, but I’d like to set the case, motherboard, power supply and cooling in stone. I really don’t mind swapping cards but that’s about the extent of what I want to do with the hardware.

I have come across an alternative motherboard mentioned in another thread on this site, the ASUS P6T7 WS Supercomputer LGA 1366 Intel X58 CEB Intel Motherboard which seems to have more expandability.

I don’t understand what the trade offs are with running lots of PCI-X 2 “x16” cards on these motherboards are. In simple terms I’d rather know that if I felt like it, could I add three CUDA cards to the system with this motherboard?

I think the motherboard we have currently configured will only allow two; and although I intend to start with one, being able to add more could amortize the cost of the host system.

You might also want to check the GTX295. From my experience its performance (due to the parameters mentioned in previous posts) is a bit higher then the C1060, however

since you can squeeze two GPUs in one slot (the GTX295 is a dual GPU) overall performance is higher.

Also it will also show you if you’re PCI bounded and help you better understand how your code scales as you throw more GPUs into one machine.

eyal

I certainly will check it out. At the moment I am scratching my head over the apparent existence of both 1GB versions of GTX 285 and 2GB versions. Why would I not want one of the 2GB versions? (for example:

MSI N285GTX SuperPipe 2G OC GeForce GTX 285 2GB 512-bit GDDR3 PCI Express 2.0 x16 HDCP Ready SLI Supported Video Card)

The GTX 295 does have twice the cores but only 1.8GB memory for 480 cores.

I am tilting toward the idea of not committing to one specific type of card. My understanding is that I can mix and match - so I could start out with a 2GB GTX 285, and then if I want more cores get a 295 or for more memory get a Tesla? Then I could tailor the acceleration to the particular problem.

Speaking of particular problem, my test case has been running for 59 hours on a quad 2.53 GHz Core 2. Looks like plenty of room for CUDA to do better.

I was refering to the GTX295 as something to look for, not because of how much RAM it has but since it is a dual card.

That means that on a board with 2 PCIe slots, you can either have 2 GPUs (2 GTX285 or 2 C1060) or you can have 4 GPUs ( 2 GTX295).

Performance wise you’ll probably be better with the 295, simply because more is better :)

As for the memory, the GTX295 has ~900MB as far as I remember. I guess that if you can put your algo on the GTX285 with 1GB of ram, you’ll manage to squeeze it into the GTX295.

I guess this is not a solution that nVidia likes too much, from obvious reasons :)

That said, nVidia and maybe even end-users would like to see production environments with Teslas and not the GTX line.

eyal

I see versions of GTX 295 that say they have 1.8GB and GTX 285 with 2GB.

But the problem is because the GTX295 has twice as many cores, it has only half as much memory per core. Memory is not typically the issue, it’s memory available to each core (or thread, or whatever your unit of parallelism granularity is).