Where has all the bandwidth gone? Bandwidth loss with concurrent sends on "independent" PCIe

I have two GPUs in an Intel system using the 5520 chipset. Each GPU is connected to the (reportedly) independent x16 PCIe port of the I/O hub (IOH).

Using the tool “bandwidthTest” from the CUDA 4.0 SDK, the available bandwidth to a single GPU is 5.8 GB/s when the other GPU is left idle. However, if I test both GPUs simultaneously, the measured effective bandwidth is about 4.3 GB/s. (BTW, I instruct bandwidthTest to use pinned memory with memory transfers upwards of 512MB.)

I can’t be saturating the QPI bus of my system because it has an available bandwidth (in a single direction) of 12.5 GB/s (and nothing else is going on in the system). The Intel datasheet for the 5520 chipset explicitly states that each independent x16 port of the IOH can operate at 8 GB/s (in a single direction). I understand that I may not be able to achieve the maximum theoretical bandwidth in practice. However, I do not understand why concurrent data transmissions on supposedly-independent ports decreases effective bandwidth. Can anyone shed some light on this? What explains my loss of bandwidth?

Things rarely scale perfectly.

Indeed. I would still like to understand the “why” of it all though.

Ah, I figured it out. The DDR3 memory in my system has a maximum throughput on a single DIMM of 10666 MB/s (1333MHz). This isn’t fast enough to pump data to two GPUs at 5.8 GB/s each. My system has three memory channels, but there will still be occasions when both GPUs need data from the same DIMM. Hence the observed slowdown. The good news is that the Tylersburg IOH works as Intel says it would-- each x16 port is independent.