I’m looking at building two CUDA 280GTX nodes for testing and wanted to know what’s the maximum number of 280 you can fit into a box before you start loosing efficiency due to overhead. I’ve seen 3 of these cards in a single box before but I think you end up losing half of your band-with on two of the cards because of the 16/8/8 bus layout. I could be wrong though but thats what I’m wondering.
Also is it more cost effective to build single 280 nodes over nodes that have more than one card in them? due to the addition cost of the MB and PSU.
They’re not double-wide slots, if I remember correctly, and aren’t you limited to PCIe Gen1?
Now, if you’re at NVISION, stop by the NV booth–we’re demoing a four-GT200 (one GTX 280 and three Tesla C1060) system running NAMD and VMD. I’ve done a rather obscene amount of research into building systems like this (I actually built this one over the past month or so), so I’d be more than happy to talk to you about how it works and what it uses. If you’re not at NVISION, you’re more than welcome to email or PM me about this.
If there is a large amount of data to get to the cards (ie: I/O bound), you’re realistically looking at 2. The 3-way SLI boards have the 3rd slots off of the south bridge so you’ll bottleneck there. Check out nVidia’s diagrams and docs for the 790i Ultra SLI, Intel’s P45 and X48. Forget Skulltrail. One thing all the people who drool over it fail to recognize is that the PCIe slots have 1.1a lanes meaning half the frequency (and throughput) of 2.0 lanes. Further it uses DDR2 800mhz. Two bottlenecks right there. Get a nice motherboard with 2 16 lane PCIe 2.0 slots off of the north bridge and DDR3 1800+, end of story.
Raw crunching with little I/O? Anything that mechanically fits a GTX 280 and is wired for any number of lanes. There are a few motherboards that do 16x, 16x, 8x, 8x (note the last two are off the southbridge).
I have an implementation of Block Wiedemann that requires matrix vector products to be computed on blocks of vectors that are ~4gb combined and a matrix that is ~16.5gb. So yeah, its I/O bound and I have to do the DMA memory shuffle. But it also runs ~20x faster than doing the same on all four cores of the Q9450 in the system.
I’m not sure this is a fair evaluation of Skulltrail. It has four 16x PCIe 1.1 slots and will take double-width cards in three of them. Also the i5400 northbridge it uses does provide two 16x PCIe 2.0 interfaces so its possible that the total bandwidth is still there but divided between four slots instead of two (I’m not sure about that though - I think the switches may only support PCIe 1.1 on the upstream side as well as the downstream side). Finally I don’t think the point about memory bandwidth is valid since Skulltrail uses quad-channel memory.
Having said that my Skulltrail board is still in its box and I’m using regular i5400 boards instead.