Quad GPU setup: GPUs keep dropping of the bus and system crashes

Hi,

I initially posted this at the Nvidia customer care who referred me here. I will paste the details of the conversation till they realised I was running linux.

########
Hi, I have had a long running problem with my system where GPUs fall of the bus or do not work properly when running the AMBER simulation package under CUDA. The system frequently locks up and is unusable.

I have tested under centos 6.4 and 6.9. I have also tested under ubuntu 14.04 using clean installs. In both OS’s the output of dmesg reports problems with the intel snd-intel-hda drivers. An example is included in the bug report included.

As I initially suspected a hardware fault, I contacted the equipment manufactures who replaced:

  1. Intel processor
  2. 4 x Geforce 780s with 4x Geforce 970
  3. Asus motherboard replaced 3 times

Despite these replacements, the resulting errors have not changed Consequently, I now think that it is likely to be a driver issue. If you could provide me with any assistance, I would be very grateful
###################

In answer to your questions:

    • Did the AMBER simulation package work fine anytime previously?

It worked perfectly. But there was a hardware issue and I had to replace the motherboard and PSU. Since that time I have also replaced the processor, and all the GPUs so I now have four 970 new geforce cards installed. However the problem has remained and motherboard has been replaced 3 times now. Therefore I need to rule out a driver related issue

    • Are you getting any error while running this application?

During running of the application one of the GPUs freezes, then drops off the bus. shortly afterwards the system freezes and needs to be rebooted. Same problem has occurred with 3 different motherboards. Problem occurs under both CUDA 6.5 and 7.5.

    • Is the issue happening only with this application or any other application also?

I have managed to recreate the error with a GPU stress testing kit called gpu-burn to simultaneously stress test the GPUs. You can read about it here:

http://wili.cc/blog/gpu-burn.html

I tested under two operating systems (ububtu and centos). Details follow:

A. Centos 6.4
Using this Linux system, the system becomes unstable when all four cards are running under load. Also during boot up there are several IRQ errors related to the hdmi audio drivers on the card (snd-intel-hda). Blacklisting the driver modules for the hardware that causes these errors results in more IRQ errors from the USB system on the main board. As I had already run a number of tests using this OS, I wanted to try a different linux OS to see if the same kind of issues occurred.

B. Ubuntu 14.04
Again on boot up there are snd-intel-hda IRQ related errors. Blacklisted and no other IRQ errors appeared. see nvidia-bug-report.log.gz

For testing I used a program called gpu-burn to simultaneously stress test the GPUs. You can read about it here:

http://wili.cc/blog/gpu-burn.html

i) First I tested with 3 GPUs (see attached Fig-1)
The program continuously performs matrix multiplication on each GPU. You can see from Fig-1 (circled in green) that one of the GPUs freezes at 6279 proc’d, whilst the other 2 GPUs are still running at 4145680 and 2609373. The output of nvidia-smi shows that 1 GPU has an error.

ii) To test whether the error was due to a specific GPU, I replaced the GPU in PCIe slot 2 with a different card and changed the order that the GPUs were installed in. (see attached Fig-2)
The exact same thing happens. One card freezes, while the others run OK. If I leave it to run for any longer the whole system freezes and I have to press computer hardware reset button.

iii) I tested with two GPU cards installed in blue PCIe slots (see attached Fig-3)
The exact same thing happens. If I leave it to run for any longer the whole system freezes and I have to press computer hardware reset button.

iv) I tested with two GPU cards installed in black PCIe slots (No picture)
The exact same thing happens. If I leave it to run for any longer the whole system freezes and I have to press computer hardware reset button.

    • Have you tried one card at a time in the system and check?
      Just tested all 4 GPUs individually with gpu-burn. all cards show no errors after 1 hour of testing. Note that in multi-gpu modes, system generally becomes unstable in under 20 mins

nvidia-bug-report.log.gz (44 KB)

Not sure how to attach files on this site. ok figured it!

anyone?

What is the wattage rating on the power supply for your setup?