very low PCIe bandwidth

Hi
It is on a machine with two GTX 280 and an GT 8600 in an EVGA 790i SLI board (the two 280GTX sitting in the outer x16 slots which should have both 16 lanes). Any idea what the reason could be? Btw. device 1 which is the other GTX 280 has the same bandwidth.

Running on…
device 0:GeForce GTX 280
Range Mode
Host to Device Bandwidth for Pageable memory

Transfer Size (Bytes) Bandwidth(MB/s)
1000000 1832.9
4000000 2370.7
7000000 2156.0
10000000 2010.8
13000000 1990.6
16000000 2010.8
19000000 2028.8

Range Mode
Device to Host Bandwidth for Pageable memory

Transfer Size (Bytes) Bandwidth(MB/s)
1000000 1446.7
4000000 1903.7
7000000 1976.6
10000000 2042.9
13000000 2062.2
16000000 2070.3
19000000 2017.7

Range Mode
Device to Device Bandwidth

Transfer Size (Bytes) Bandwidth(MB/s)
1000000 84282.6
4000000 108254.0
7000000 111001.5
10000000 112057.8
13000000 113705.9
16000000 113477.3
19000000 114670.7

&&&& Test PASSED

Press ENTER to exit…

Best regards
Ceearem

Can you run the test again with --memory=pinned ?

Also run bandwidthTest --memory=pinned with only the first GTX 280 installed, and then with just the two GTX 280s.

This is using pinned memory:

Running on…
device 0:GeForce GTX 280
Range Mode
Host to Device Bandwidth for Pinned memory

Transfer Size (Bytes) Bandwidth(MB/s)
1000000 3028.5
4000000 3077.3
7000000 3086.8
10000000 3088.3
13000000 3089.9
16000000 3091.4
19000000 3091.5

Range Mode
Device to Host Bandwidth for Pinned memory

Transfer Size (Bytes) Bandwidth(MB/s)
1000000 3035.0
4000000 3087.0
7000000 3087.4
10000000 3091.3
13000000 3117.9
16000000 3125.3
19000000 3094.8

Range Mode
Device to Device Bandwidth

Transfer Size (Bytes) Bandwidth(MB/s)
1000000 83316.5
4000000 108190.1
7000000 111025.2
10000000 112030.4
13000000 113153.0
16000000 113238.8
19000000 114582.6

&&&& Test PASSED

Press ENTER to exit…

I might test taking out GPUs but this is not as easy as hear at home, since the PC is part of our Cluster at the University. But I’ll try to do that next week. Our Admin is interested in the reason of the low transfer rates as well.

Best regards
Ceearem

Those bandwidth numbers look like the slots are running at x8 instead of x16. (75% of 4GB/sec ?)

Yeah, that is very consistent with PCI-E 1.0 x16 or PCI-E 2.0 x8.

Since you are accessing the machine remotely, just pull the GT 8600. This should restore 16 PCIe lanes to your Teslas for a 790i chipset motherboard. If your admin needs console video, install a PCI video card. (Edit: with an NF200 bridge, the 790i chipset SHOULD allow 3x16 PCIe 2.0 lanes. Check to see what other PCIe devices you have installed.)

The 8600 has the purpose of running an X-Server on that machine and not having a time limit on kernels before the X-Server complains, so we like to have it. And I thought as you said that the 780i and 790i chipset should support at least 2 16PCIe 2.0 cards + a third card. I think both have a total of 62 lanes (32 for the main slots handled by the northbridge and the rest over the southbridge) so even if there are more pci devices (such as integrated soundcard and network card on the mainboard i guess) there should be enough lanes to support the gpus at full speed right?

Maybe I should try to make an bios update or so. Ill see.

Thanks for all comments here.

Ceearem

PCI and PCI Express are quite different despite the similar name. A PCI video card (or the IGP video that your motherboard lacks) works fine for X windows. Not high-resolution or anything, but the GT 8600 is not exactly high performance either. The main point is that PCI devices are on their own PCI bus and are not using PCIe lanes.

I also have an EVGA 790i motherboard (EVGA 132-YW-E179-A1 nForce 790i SLI FTW). Here is the lspci output that may help you to determine if you have any other PCIe devices besides the GPUs. The GTX 280 is the only PCIe component in this system at the moment:

[root@chicadee ~]# dmidecode | grep -A 1 EVGA
Manufacturer: EVGA
Product Name: 132-YW-E179-FTW

[root@chicadee ~]# lspci
00:00.0 Host bridge: nVidia Corporation Unknown device 0802 (rev b1)
00:00.1 RAM memory: nVidia Corporation Unknown device 0808 (rev a1)
00:00.2 RAM memory: nVidia Corporation Unknown device 0809 (rev a1)
00:00.3 RAM memory: nVidia Corporation Unknown device 080a (rev a1)
00:00.4 RAM memory: nVidia Corporation Unknown device 080b (rev a1)
00:00.5 RAM memory: nVidia Corporation Unknown device 080c (rev b1)
00:00.6 RAM memory: nVidia Corporation Unknown device 080d (rev a1)
00:00.7 RAM memory: nVidia Corporation Unknown device 080e (rev a1)
00:01.0 RAM memory: nVidia Corporation Unknown device 080f (rev a1)
00:01.1 RAM memory: nVidia Corporation Unknown device 0810 (rev a1)
00:01.2 RAM memory: nVidia Corporation Unknown device 0811 (rev a1)
00:01.3 RAM memory: nVidia Corporation Unknown device 0812 (rev a1)
00:01.4 RAM memory: nVidia Corporation Unknown device 0813 (rev a1)
00:01.5 RAM memory: nVidia Corporation Unknown device 0814 (rev a1)
00:01.6 RAM memory: nVidia Corporation Unknown device 081a (rev a1)
00:01.7 RAM memory: nVidia Corporation Unknown device 080e (rev a1)
00:02.0 PCI bridge: nVidia Corporation Unknown device 0815 (rev a1)
00:04.0 PCI bridge: nVidia Corporation Unknown device 0817 (rev a1)
00:09.0 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a2)
00:0a.0 ISA bridge: nVidia Corporation MCP55 LPC Bridge (rev a3)
00:0a.1 SMBus: nVidia Corporation MCP55 SMBus (rev a3)
00:0b.0 USB Controller: nVidia Corporation MCP55 USB Controller (rev a1)
00:0b.1 USB Controller: nVidia Corporation MCP55 USB Controller (rev a2)
00:0d.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)
00:0e.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:0e.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:0e.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:0f.0 PCI bridge: nVidia Corporation MCP55 PCI bridge (rev a2)
00:0f.1 Audio device: nVidia Corporation MCP55 High Definition Audio (rev a2)
00:11.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a3)
00:12.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a3)
00:14.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:15.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
01:00.0 VGA compatible controller: nVidia Corporation GT200 [GeForce GTX 280] (rev a1)
03:07.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link)
04:00.0 SATA controller: JMicron Technology Corp. 20360/20363 Serial ATA Controller (rev 03)
04:00.1 IDE interface: JMicron Technology Corp. 20360/20363 Serial ATA Controller (rev 03)
05:00.0 SATA controller: JMicron Technology Corp. 20360/20363 Serial ATA Controller (rev 03)
05:00.1 IDE interface: JMicron Technology Corp. 20360/20363 Serial ATA Controller (rev 03)
[root@chicadee ~]#

Here is the bandwidthTest and deviceQuery output:

[root@chicadee ~]# /usr/local/cuda_sdk/C/bin/linux/release/bandwidthTest --memory=pinned
Running on…
device 0:GeForce GTX 280
Quick Mode
Host to Device Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5680.5

Quick Mode
Device to Host Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5507.1

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 120952.6

&&&& Test PASSED

Press ENTER to exit…

[root@chicadee ~]# /usr/local/cuda_sdk/C/bin/linux/release/deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA

Device 0: “GeForce GTX 280”
CUDA Driver Version: 2.30
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 1073020928 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

Test PASSED

Press ENTER to exit…