Low Aggregate PCI Bandwidth for 9800GX2

Hello all,

I have a 9800GX2 that I’m trying to exercise with full PCIe v2 bandwidth. I have verified that the motherboard (MSI P7N Diamond) supports PCIe v2, and the card itself is PCIe v2. Running the bandwidth test with device=all yields 6+GB/s, but this test is performed serially instead of in parallel. When I run my own test with both cards performing at the same time, I get 3GB/s, which is the same as if transferring to one core, which is also the same as a PCIe v1 card would get.

Is this a problem with my motherboard? My tests? My assumptions? I am assuming that both GPUs should be able to upload at the same time and get 6GB/s, same with download.

Matthew

Well, yeah, that assumption is wrong. A 9800 GX2 is two G92 chips connected to each other in a single slot. PCIe 2.0 gets you about 6GB/s per slot. Ergo, when streaming data to both GPUs simultaneously, you get 3GB/s per GPU.

Well, my assumption was 6 GB/s aggregate, not per. As it is, I get 3 GB/s aggregate across both GPUs simulatenously. That is, when both are going at the same time, I get 1.5GB/s per GPU.

Okay, let me make sure I understand what you’re saying…

Running bandwidthTest on each chip independently = 6 GB/s per chip.

Running your own bandwidth test on both chips simultaneously = 1.5 GB/s per chip?

Haha, sorry, I’m suffering from imprecise language. :) The general truism is: I’m not getting over 3 GB/s over the bus at a given moment.

Running bandwidthTest on one chip: 3 GB/s

Running bandwidthTest on two chips (which does this serially for each chip): 6 GB/s (3 GB/s per chip, not simultaneously)

Running a test where transfers go to both chips simultaneously: 3 GB/s (1.5 GB/s per chip simultaneously)

Double check that PCIe 2.0 is enabled in the BIOS, then. I’ve seen some boards where for whatever reason it’s not enabled by default.

I’ve heard this from a few people, but unfortunately I cannot find an option pertaining to PCIe2 in the BIOS. I’ve spidered it several times.

Would someone be willing to run the benchmark I’ve concocted on their GX2? I have attached source, renamed from .cu to .txt extension. Just compile with lpthread.
bandwidthTest.txt (1.56 KB)

Here is the result on my 9800 GX2 system (780i MB):

Testing 0 and 1

3953.766717 MB/s

This is right in line with the single GPU performance as expected:

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               3952.1

Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               3874.4

Just for fun, I also modified your bandwidth test file to run on GPUs 1 and 2, which are the 2nd GPU of the 9800 GX2 and an 8800 GTX (G92) in the 2nd PCI-e x16 slot. Thus, this test removes the effect of the shared PCI-e slot for the 9800 GX2 and benchmarks how fast 2 GPUs can be fed with the full resources available.

Not surprisingly, the results are:

Testing 1 and 2

4010.857409 MB/s

Why am I not surprised that this isn’t faster? 1) The 780i MB runs one PCIe link to the NF200 chip which drives both x16 PCIe v2 slots and thus it cannot possibly feed both cards at full bandwidth. 2) System memory speed starts to become the limiting factor, too, at these speeds.

I’ve seen bandwidth tests performed which show high bandwidth to a single-chip PCIe2 card, like this: http://forums.nvidia.com/index.php?act=Pri…er&f=64&t=71468
On average, 5.2GB/s to a single 280GTX.

Also, here is a test done on a 9800 GTX:
http://forums.nvidia.com/lofiversion/index.php?t70042.html

It seems like only eight lanes of PCIe are usable by each chip, but only one or the other can transfer at a time… Meaning that full PCIe2 speeds aren’t achievable.

I found another machine to plug this card into, and the results are much the same.

More thoughts?

edit: double post due to forum software error

That doesn’t make any sense. The benchmarks I ran got near 4 GiB/s to a single GPU. This is the theoretical peak of an x8 PCIe v2 connection. There is a packet overheat of ~15% for each transfer so I could not possibly get this performance on an x8 connection to a single GPU.

http://www.bit-tech.net/hardware/2008/03/1…graphics_card/2 details that the switch connecting the two GPUs in the 9800 GTX delivers x16 bandwidth to each GPU.

As I outlined in my post above, the chipset I’m running is to blame for the ~4 GiB/s throughput. Every report on these forms I’ve seen for 5-6 GiB/s was running on an Intel PCIe v2 chipset.

On the same MB, I bet it will also sustain ~6 GiB/s to a single GPU, or 6 GiB/s total spread evenly to the 2 GPUs. I just don’t have such a system to test on.

I figured an update is in order, as I have resolved that it was the chipset. I bought an Intel board and the problems went away.

The bandwidth test I posted reports 5384 MB/s.

NVIDIA’s bandwidth test reports:

Quick Mode
Host to Device Bandwidth for Pinned memory

Transfer Size (Bytes) Bandwidth(MB/s)
33554432 9486.3

Quick Mode
Device to Host Bandwidth for Pinned memory

Transfer Size (Bytes) Bandwidth(MB/s)
33554432 8840.0

Those are the fastest bandwidth numbers I’ve seen on the forum! What board is it? I’m assuming that it runs DDR3 memory, too, given the perf you are getting.

Well, it is running DDR3, but this exposes my gripes about the NVIDIA bandwidth test. This is the sum of the bandwidths to each chip serially, not in parallel. The parallel test gives a little more than half those figures. Running the test on one chip gives almost exactly half those figures.

Ah, that explains it.

And it gives another example of a DDR3 board not performing above ~6 GiB/s. Intel’s PCIe v2 implementation must have some bottleneck in there somewhere too, just higher than NF200’s.