how to achieve equal bandwith to 3 GPUs in CUDA? (searching for recent motherboards)

CUDA users wanting maximum performance of multi-gpu systems worry about the efficient PCIe transport to/from the cpu.
In that respect, there may be a big problem with the new X58/i7 systems. there is no FSB and the cpu is handling PCIe, which sounds great (one less thing to cool?) alas…

it is revealing to compare EVGA 790i nForce SLI FTW that I have in my system described in the link below, with the X58 motherboards from ASUS (P6T X58 deluxe) and Gigabyte (GA-EX58-extreme or UD5). The latter have 3 physical x16 slots just like the EVGA but unlike it, do not provide enough bandwidth to use them simultaneously: you can only use full (PCIe 2.0) x16 throughput on two first slots, and if you want to have a 3rd card, both SECOND and THIRD slot must be slower, x8 . Thus, according to their documentation, they are WORSE than the EVGA 790i system. that would be very sad.

ASUS Striker II Extreme board is a socket 775 board similar to my EVGA and from the documentation it seems to have very similar PCIe capabilities: two slots at full (2.0) x16 speed, one middle slot at (1.0) x16 speed.I like it’s cooling scheme for the chipset.

the X58 manufacturers are apparently under no pressure to widen PCIe, since the SLI/crossfire takes over from the pci express bus the duty of gluing the cards together. but why would they go back and reduce the PCI throughput as compared to 790i? no idea, unless this is a limitation of the QPI sections on the cpu.

do you know any recent motherboards with a large number of PCIe lines to (ideally) drive 3 CUDA (non-SLI) cards at the same speed of about 6 GB/s?

related to the PCI issue is the prospect of computing with gtx 295 dual-gpu card. I wonder if/how the lower bandwidth per gpu will affect actual CUDA applications. if it doesn’t (e.g., applications sending short bursts of data only) then maybe I worry about the bandwidth too much…
those of you who get this new card, please report the results.

I think that you are looking into bandwidth too much.

For example on the S1070 there are 2 GPUs per daughter card and the daughter cards can be had in x8 or x16 flavor (x4 or x8 per GPU) and the performance is still very good.

I have a hard time believing that todays GPU’s are even coming close to saturating the PCI-E bus as we havent even seen the advantage of PCI-e 2.0 come into play yet.

However there are i7 board that utilize the NVIDIA nf200 chip to give 3x16 lanes to the GPU’s, I believe that the ASUS p6t6 is the only board so far to utilize this chip, I could be wrong though.

Although not compute here is a good link to a review on a nf200 x58 vs a bare x58.…W50aHVzaWFzdA==

IF you do some searching you will also find that gaming SLI performance went up on the x58 chipset as well compared to the 790i

This is untrue. If you need 3.5 GB on the GPU for any given computation (e.g., streaming is impossible), you will certainly see a performance increase by going to PCIe 2.0 16x compared to 2.0 8x or 1.0 16x. I forget where exactly the bus is saturated once you take into account signaling overhead, but we’re not at all far away from filling the bus.

pawel, the effects of bandwidth vary on the application. mfatica has been testing an application on a four-GPU platform of mine today, and doubling the PCIe bandwidth gave him a 30% performance increase. There are motherboards with more PCIe lanes coming, just be patient. Keep in mind that I think you’re not able to saturate three 16x links on 790i because of the FSB, though, so in reality X58 with 16x/16x/8x is probably no worse for concurrent bandwidth.

(did I ever post my concurrent bandwidth test? I probably should…)

this board:…amp;modelmenu=1 is advertised to have 3 true x16 gen2 slots (besides the other 3 x16->x8 gen2)… don’t know if they can really get that performance but i would be very interested to know. ;-)
especially a test with 6 single-slot watercooled cards (are there single-slot water-blocks available for gtx280/260?) would be awesome g

As far as I know, X58 has 36 PCIe lanes so you’ll never see 3 full 16x PCIe slots with Bloomfield.

hmm, then why are they emphasizing this thing everywhere on their page?

Because of the NF200 chip that they are putting on the board.

This is the same chip that NVIDIA used to make the 780i a “true 3 x16”

I am a little in the fog on how it works but my assumption is that it just acts as another southbridge for more PCI-e lanes, or does some sort of compression, Maybe Tmurray could expand a little on this.

The NF200 chip is a so-called multiplexer, or switch if you want, i.e., it takes 16 PCIe lanes and allows two devices to be connected to it using 16 lanes each. Effectively it doubles the number of lanes at one side. Naturally this doesn’t affect the number of lanes connected to the chipset, and as can be seen in an NF200-related benchmark such as posted today at the HardOCP website, the NF200 chip introduces significant enough latency that with a 3-way SLI setup the NF200 board with triple x16 slots is always a few frames behind the plain X58 board, with 16-8-8 slots.

NF200 doesn’t introduce any performance penalty for CUDA. It certainly doesn’t affect bandwidth–with a Quadro Plex D2 (so dual GT200s behind an NF200), my concurrent bandwidth test shows identical total bandwidth when testing either GPU individually or both GPUs simultaneously, and peak bandwidth to a single card was identical to a C1060 in the same slot.

But it doesn’t sound like you tested latency (ie, small transfer sizes).

You are worrying about it too much. Not because some applications don’t need as much bandwidth as they can get, but because you have no idea if yours does.

There are three independent factors to consider:

  1. Bandwidth to any one card (ie, large transfer performance)

  2. Latency to any one card (ie, small transfer performance)

  3. Bandwidth to all cards at once.

#1 is concerned with x8 vs x16. Here, NF200 helps a lot.

Yet, #3 is usually more important than #1. This is because if (if) you’re running an algorithm where #1 really matters, where you’re doing a lot of transfers during computation, you’re doing them to all of your cards. Moreover, #3 is bottlenecked on many systems, because the system RAM is barely sufficient for one PCIe x16 2.0 link. X58, given its triple-channel DDR3, has a clear lead in this situation (despite QPI itself being somewhat of a bottleneck). In addition, with regard to #3, the NF200 helps less.

But for a good number of algorithms it’s #2 that’s the problem. I don’t know specifically how it’s affected, but both NF200 adds latency, and the QPI adds latency. The question is if either of those are comparable to the overhead of the CUDA software stack itself.

Of course, the most efficient algorithms that show the biggest boosts don’t care about any of these. They spend their time processing data, not shuffling it across the bus.

ah, this clears things up, thanks! :-)

so 2 slots have a combined x16 but if only 1 communicates it gets the full bandwidth… the third one is x16 directly over x58, i assume.

that’s still a very nice setup, better than the usual x16/x8/x8.

now the only question remaining is: can you actually reach ~11 GB/s concurrent bandwidth with nehalem?

Haven’t had a chance to put enough cards into a Nehalem to find out yet.

thanks, guys!! almost all my prayers seem to be answered with the p6t6 ws revolution !
should be ideal for a production-run machine I have in mind. I love its stability at ambient T= 60 C during test. board components heated up to 77, some 80+C and worked. (and the chipset is passively cooled.)

It’s true that before porting my hydrocodes to CUDA I don’t even know if the bandwidth is the bottleneck. but now I see no reason to hesitate… even disregarding the multiplexer, this board would be a clear choice, especially for an air-cooled system with 295s.