I’ve setup a machine for running CUDA codes on 4x GTX 680 cards using the Z77 platform (vs X79 that is not officialy supported by nvidia at PCIe 3.0 speed). The machine is built around a mobo (Asus P8Z77 WS, latest BIOS 3205) featuring four PCIe 3.0 16x slots managed by a PLX 8747 switch to split the 16x PCIe 3.0 lanes from the CPU into four 8x lanes. The machine has been running rock stable for weeks with a single GTX 680 card plugged into the first PCIe 3.0 slot. It is powered with a 1350W PSU with the extra Molex plugged onto the mobo. The integrated Intel GPU is used for display, and no screen is plugged on the GTX 680 card(s).
In short: a PCIe 3.0 card plugged into the 4th slot, and only that one, renders the nvidia driver unstable. It does not happen neither with PCIe 3.0 cards plugged into slots 1 to 3 nor PCIe 2.0 cards in slot 4.
Going to the 4x GTX 680 setup, I started to get within minutes these nvidia driver Xid 59 errors reported always for the same card. It turns out that the cards all work fine as long as they are plugged into any of the 3 first slot except the 4th slot. The error happens after a while just by running ‘nvidia-smi -l 1’ or earlier by loading the card(s) with a cuda memcheck test. Interestingly, the problem never occurs within hours of testing with PCIe 2.0 cards (GTX 580, Quadro 4000). So it seems that the stability of the nvidia driver is challenged by running a PCIe 3.0 card on this 4th slot, and only that one, which is the most distant from the PLX switch.
Typically, for a single card plugged in the 4th slot one gets:
[ 9.386032] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 310.19 Thu Nov 8 00:52:03 PST 2012
[ 54.368022] nvidia 0000:05:00.0: irq 66 for MSI/MSI-X
[ 54.999320] NVRM: GPU at 0000:05:00: GPU-3c52f841-dcb9-73e0-7609-361293e889d3
[ 348.596567] NVRM: Xid (0000:05:00): 59, 0098(209c) 0400c287 12c93713
[ 393.437650] [sched_delayed] sched: RT throttling activated
[ 686.339589] nvidia 0000:05:00.0: irq 66 for MSI/MSI-X
[ 690.486892] NVRM: RmInitAdapter failed! (0x27:0x38:1077)
[ 690.486906] NVRM: rm_init_adapter(0) failed
and then nvidia-smi reports:
NVIDIA: could not open the device file /dev/nvidia0 (Input/output error).
NVIDIA-SMI has failed because it couldn’t communicate with NVIDIA driver. Make sure that latest NVIDIA driver is installed and running.
When the Xid error 59 happens, the system does not necessarily freeze depending on the kernel and nvidia driver versions; however the whole system clearly gets unusable for CUDA computing. I’ve been testing various combinations of kernels/distros (only Ubuntu so far: 8.04, 10.04, and here 12.04 with a 3.5 Linux kernel) and underclock setups for the hardware to no avail. The longest stable test I managed to get was about 45 minutes of all 4 GTX 680 cards loaded with cuda memtests using Ubuntu 8.04 and the 295.59 drivers, but once frozen the machine could not be recover except with hard boot. Given that and the fact that PCIe 2.0 card are able to work for hours, I don’t think this is a hardware failure of the mobo in particular; again individually all GTX 680 run fine.
- What this Xid 59 error really mean? I’ve never seen it ever reported before.
- Why only with the 4th PCIe slot? Could it be that the signal gets too weak or latency increases for too long (esp. through the PLX swith) which confuses the nvidia driver?
- As a temporarly workaround to be able to use my 4 cards and not only 3, is it possible to force PCI 2.0 on all or a given slot until the nvidia driver fixes the problem?