4 Teslas on a system: Any known performance issues?

We are planning on getting a system with 4 C1060’s in a system through a vendor recommended by NVidia ( http://www.nvidia.com/object/tesla_supercomputer_wtb.html ). Some of the specs are: Nvidia 780a chipset AM2+/AM2/Phenom motherboard with AMD Phenom II X4 920 2.8GHz Processor. 16 GB RAM.

Are there any known issues of performance in terms of speed or otherwise?

On the paper it should go better than a S1070 due the fact that you will use for each device a dedicated PCI slot (may be the chipset has 4 ways?)

Gaetano

Hi,

I’ve recently posted our findings (on a system with GTX295, not the teslas, but I guess with the teslas it would be

easier mostly because of PCI issues).

Take a look at: http://forums.nvidia.com/index.php?showtopic=98522

good luck :)

eyal

Actually, probably not any better. I’m not aware of a motherboard that has enough PCI-Express lanes to supply 16 lanes to 4 separate cards. All motherboards just automatically downgrade the slots when more than 2 devices are present. The only 780a motherboard with 4 slots I could find quickly was the Foxconn Destroyer, which when populated with 4 cards drops all slots to 8 lanes. In fact, the S1070 could be faster in some cases because each link is x16, so if only one of the two cards sharing the link is transferring data, it gets the full x16 bandwidth.

Intel Skulltrail has 4 16x slots, though I think there’s only spacing for 3 double-wide cards.

I saw you had 16 GB RAM, do you know if it was possible to install more, say, 32 GB ?

Huh, interesting. Reading that diagram, it looks like they did it by putting bridge chips on two x16 links to give you 4 slots. So in that case, accessing the 4 cards, aside from the physical spacing problem, should have the same performance characteristics as the S1070. (But not the bandwidth of four dedicated x16 links. Not that many memory subsystems could sustain that much data transfer between host and device.)

Yeah, effectively Skulltrail used exactly the same 16 lane PCI-e 2.0 switching arrangement as the S1070 and the GTX295. It just has the switching logic on the motherboard, rather than on an outboard card.

Right now the “best” chipsets for total PCI-e 2.0 lane count are the AMD 790FX, which gives 32 lanes for GPUs (plus an additional 6 for non-gpu slots), and the Intel X58, which has 36 total lanes. Both will configure to 4 PCI-e x8 for quad GPU support.

No I dont think there was room for more. However the GPU code ran on GTX295 with <1GB ram. What I’m tring to say

is that if your algorithm can run even on a Tesla (with 4GB) obviously you dont need more than that on the CPU( or twice that for double buffering the data)…

eyal