I have two GPUs in an Intel system using the 5520 chipset. Each GPU is connected to the (reportedly) independent x16 PCIe port of the I/O hub (IOH).
Using the tool “bandwidthTest” from the CUDA 4.0 SDK, the available bandwidth to a single GPU is 5.8 GB/s when the other GPU is left idle. However, if I test both GPUs simultaneously, the measured effective bandwidth is about 4.3 GB/s. (BTW, I instruct bandwidthTest to use pinned memory with memory transfers upwards of 512MB.)
I can’t be saturating the QPI bus of my system because it has an available bandwidth (in a single direction) of 12.5 GB/s (and nothing else is going on in the system). The Intel datasheet for the 5520 chipset explicitly states that each independent x16 port of the IOH can operate at 8 GB/s (in a single direction). I understand that I may not be able to achieve the maximum theoretical bandwidth in practice. However, I do not understand why concurrent data transmissions on supposedly-independent ports decreases effective bandwidth. Can anyone shed some light on this? What explains my loss of bandwidth?