A Killer CUDA box

I need to build a machine specifically for running CUDA apps. The app will actually be doing quite a bit of CPU processing as well and requires a LOT of RAM.

My thinking has been to get this board:

[url=“Super Micro Computer, Inc. - Aplus Products | Motherboards | H8QM3-2”]Supermicro SuperBlades, uGPU, AI System, Multi-Node Servers

And then populate it with 2-4 quad opterons and eventually, I hope, the full 256GB of RAM (though I’m going to wait a year or so until the price of 8GB ram modules drops significantly).

Currently I’m planning to use two 280s, though this purchase is several months off and I’m wondering if there’s any truth to the rumors of the 350s being released in Q4…

So, here are my questions:

1> Each card is going to peg a core, correct? It only pegs one core, per card, however, is that correct? So if I have 2 quad cores in the machine, I should be able to have 6 available cores while CUDA calls are being made, is this correct?

2> Is there any way to estimate how quickly this board will transfer memory between host and device? If not, can anyone recommend a board capable of holding at least 64MB, has at least 2 PCI-e 16x slots, and is known to perform quick transfers?

3> When using multiple CUDA devices, does anyone have any experience regarding optimizing memory transfers? For example, would it be best to have 2 cores copying data to both cards simultaneously or would it be better to do some sort of interleaving like this::

Thread 1> Data Copy CPU to GPU#1
Thread 1> Launch kernel — Thread 2> Data Copy CPU to GPU#2
Thread 2> Launch kernel — Thread 1> Data Copy GPU #1 to CPU
Thread 1> Data Copy GPU #2 to CPU

Thanks for any help anyone can provide…

Pete

  1. Not necessarily. You can do async processing and poll the GPU queues. This avoids CPU busy loops.

  2. This board, Tyan Tempest i5400PW (S5397), has two x16 slots, integrated video. Also takes a truckload of RAM.

http://www.tyan.com/product_board_detail.aspx?pid=560

It’s a Xeon board though.

Jesus christ that motherboard if a monster. Note that the formfactor is “proprietary.”

For quick transfers, you want a PCIe 2.0 board.

Yes, NVIDIA will have new cards soon. It’s got to, to compete with ATI. I haven’t heard that they’ll be much better than the current ones. I think they’ll be a die shrink which is cooler and cheaper to produce. Maybe a tad higher-clocked. After those come out, NVIDIA will use them to make an x2 board. Dunno when, seeing as NVIDIA is fighting tooth-and-neck with ATI, but I can’t imagine it’d be this year. Then again, ATI already has a x2 board out… In any case, you’ll see other 55nm boards come out first.

Btw, a “350” would be slower than a “280.” It’d probably a single-slot, 55nm, cooler, cheaper 260.

kristleifur,

Thanks for the info. That Tyan board looks pretty sweet. I’d be okay with going with a Xeon. It’s not like the Opteron 8000 series are exactly cheap.

Does the Async stuff actually work? How do you poll the GPU queues? I missed that…

I did realize that if I don’t call cudaThreadSyncrhonize(), my app keeps running, though I’ve never checked to see what the CPU usage is in that case.

Alex,

Why would the 350 be slower? I read somewhere that it’ll have 2GB of RAM. If it does and its speed is even close to the 280, then that might be enough to make me go over since my app uses LOTS of RAM and the more threads I can get going concurrently, the better. Cooler and lower power would be nice as well.

I keep wondering if I should go with ATI instead, since they seem to have the speed advantage right now. I’ve started getting comfortable with CUDA, and I hate to change, though.

I’d really like to see a comparison of ATI and nVidia cards in terms of FLOPS/$ cost of card and FLOPS per watt, since that kind of comparison would really make it easier to find the best price point for the $$.

Pete

I think that realistically almost everybody does not get bounded by GFLOPS performance. I think most people get bounded by bandwidth to the memory.

GTX280 gets you 141 Gb/s. Okay, it has only 1 Gb, but then a C1060 will get you 4 Gb at 104 Gb/s. As far as I know, the ATI’s do not reach these numbers.

Okay, then ATI won’t interest me. My stuff is definitely bandwidth bound.

The 4870 does 115 GB/s. It only uses a 256 bit bus, but it clocks it much higher than NVIDIA (thanks to GDDR5 ram).

Now, how you would program that thing… is another question. Anyone here try their stuff??

If it does have 2GB ram (wow. does ANY game need that??) then yeah that’d be a clear advantage.

But overall I’m guess it’d be slower given many years of GPU model numbering. The “5” is much lower than an “8” (we’re talking mid-range vs high-end) and the half-generation step between the 3xx and 2xx won’t make the mid-range card faster than the old high-end. I’m guessing it’ll have the same performance as a 260. (Plus all that stuff about 55nm, single-slot, cheaper-to-make-version-of-g200, etc, fit into this theory. It really shouldn’t be called a 3xx, but they’ve been incrementing that first number willy-nilly these days. Since the 2xx is getting whooped by ATI, the better competitor obviously has to be 3xx.)

Btw, it sounds like you haven’t even started writing your CUDA app. Dude… don’t get all excited and splurge on a “killer rig” before you even start writing the thing that’s gonna run on it.

External Media

You use stream objects AFAIK. Haven’t done this yet, but you create these stream objects which are sort of an event queue. Then you stack async ops in a stream, to define what order they must be executed in if they are dependent. Independent ops go in different streams. Check the async sample project in the SDK for some STRAEM :magic: MAGICK.

Edit - AFAIK you just call a function on the stream - Hey, stream, are you up to here yet? If the stream says Dude, yes, you do something with the data. If not, you can sleep the CPU thread and return to processing whatever else threads n’ processes you’ve got cooking.

It’s not an application, so much as a group of related applications, really, and they’re about a month from completion. The CUDA part is actually only a very small piece, code-wise. It just happens to be where my app will be spending 99% of its time, and the CUDA stuff is actually already written. It’s the rest of the app that’s taking time. The CUDA code itself is pretty trivial, it just needs to process mountains of data.

That’s why I’m still at a point where I’m willing to entertain using ATI cards. 2 x 280 GTXs won’t give me quite the performance I want, but it’ll give me the most performance I can afford right now. For the performance I want, I’d need 10 or more of those machines.

Thanks for the information. I haven’t played with the stream stuff yet. I’ll look into it.

Alex, you said the 350 would be slower than the 280. From the reports I’ve seen, that won’t be the case.

The specs posted on wikipedia (though their source is questionable, at best) claims 2GB DDR5 with 216 GB/s bandwidth, an 830mhz core clock rate, and a 2075mhz shader clock rate.

If the real numbers are anywhere near that, it’ll dust the 280.

216 GB/s bandwidth would mean I would be buying a few. When should this 350 be coming???

It’s all speculation at this point, I think. My understanding is that an Australian vendor posted pre-order information for the 350, but from what I can tell, though the item is still listed, the details page is missing.

The speculation is that it would Q4 this year, so sometime between now and the end of the year. The price listed I think was about $800 Australian which in real money :D is ~$650 US.

There’s also something regarding 270 and 290 models appear in Q1 and Q2, respectively, of 2009, though no specs.

Oh, well, if it’s on Wikipedia… :P

Actually, Wikipedia lists references… and it’s to a dinky Australian online store. (see here: http://www.tweaktown.com/news/10187/e_tail…u_information/)

This card sounds like an x2. 480 shaders = 240x2, 2Gb=1x2, etc. Next to it in the same E-shop is the “280+” which is just listed as the 55nm version of 280.

It’s exactly what I was saying in post #3. First comes the 55nm 280, then soon after, the x2. I still doubt it’s going to be available (ie not paper-launched) in the next three months, and especially that it’ll be called an x5x. I also don’t believe the clocks are gonna be higher than on a 280 (the heat is already killing the 280, and 55nm is supposed to reduce that two fold?)

P.S. it looks like the NVIDIA might cut down the bus width to 256 per card and use GDDR5 just like ATI. It’d be a smart move, money-wise.

Well, first of all, note that I did say, “(though their source is questionable, at best)”. That said, Austin Computers isn’t really rinky dink. They’re pretty big. Someone from Australia might be able to correct me, but I think they’re more like the NewEgg of Australia.

I don’t think the 350 is even based on the 200 chips. I think it’s supposed to be a different chip.

There’s also the supposedly forthcoming 270 and 290. The 270 looks like Austin’s 280+.

The 290 (on Wiki) says 40nm…

I’m going to hold out a bit and see what happens with the 350. Hopefully, at the very least, some real info might come out of nVidia about it before too long. If those specs are even close and the price point is around $650, then that’s definitely worth waiting for.

Trust me. It’s going to be 55nm 280/260, then a 55nm 280/260x2. If you believe wikipedia that 55nm 280/260 (the 270 or whatnot) comes out in Q1 '09, then there’s nothing, nothing coming out Q4 '08. Certainly not a new architecture. Don’t be silly.

Austin computers isn’t really rinky dink, i agree. But they aren’t that big, more like NewEgg for only Western Australia

I bought a few pieces and then pickup from their stores before