One of two GPUs disappears from nvidia-smi

Hi,

I am running ubuntu 17.04 and nvidia driver 387.12 with 2 1080 TI’s. When I first boot the machine, nvidia-smi sees both GPUs. But after some hours of idling, one of the GPUs would disappear from nvidia-smi and become unusable. Upon reboot, I would see both GPUs again, only to lose one after some time again. What could be the problem? Below are some common commands and output I’ve tried to see if you guys can make sense of it.

lspci | grep VGA 
17:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
65:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
dmesg | grep -i nvrm | head
[170045.159496] NVRM: rm_init_adapter failed for device bearing minor number 1
[170051.110824] NVRM: RmInitAdapter failed! (0x24:0x65:1076)
[170051.110881] NVRM: rm_init_adapter failed for device bearing minor number 1
[170057.117426] NVRM: RmInitAdapter failed! (0x24:0x65:1076)
[170057.117504] NVRM: rm_init_adapter failed for device bearing minor number 1
[170063.129279] NVRM: RmInitAdapter failed! (0x24:0x65:1076)
[170063.129351] NVRM: rm_init_adapter failed for device bearing minor number 1
[170069.202843] NVRM: RmInitAdapter failed! (0x24:0x65:1076)
[170069.202919] NVRM: rm_init_adapter failed for device bearing minor number 1
[170075.126983] NVRM: RmInitAdapter failed! (0x24:0x65:1076)
...
nvidia-smi 
Wed Dec 20 09:41:41 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.12                 Driver Version: 387.12                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:17:00.0 Off |                  N/A |
|  0%   29C    P5    22W / 300W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Overheating and improper power delivery are two common problems. Neither seem likely if the GPU is just idling. Platform instability is also a possibility.

I would recommend trying a later r387 driver such as this one:

http://www.nvidia.com/download/driverResults.aspx/127149/en-us

You might also try updating your system BIOS to the latest one offered by your motherboard manufacturer.

Updating the SBIOS to the latest, as txbob says, is often a good idea; check system vendor’s recommendations.

From the description, this sounds like something related to hardware, rather than software. Is it always the same GPU that disappears after a while? If so, swap the two GPUs to see whether the failures stay with the PCIe slot or follow the card. Both slots are PCIe gen3 x16 slots, correct?

In the best case, the GPUs are not plugged into the slot properly (make sure brackets are secured with screw, latch, etc to prevent mechanical issues that may turn into electrical signalling issues) or the connector is dirty. In the worst case, either the slot or the GPU may be damaged (e.g. bent connector fingers, zapped by electrostatic discharge, hairline cracks in the motherboard).

Myriad failure modes are possible, it is usually not possible to diagnose such cases remotely. We have tried many times in these forums, with maybe a 20% success rate.

Thank you all for the responses. I’ve upgraded my SBIOS to the lastest version (1.4 for ASROCK x299 SLI/killer), and upgraded the nvidia driver to 387.34. I’ve tried swapping the two cards, and determined that the same slot is always the one disappearing. So we can rule out faulty GPU. Also of note is that both card were used heavily for training (90%+ utilization for 4+ hours, so heat + psu is likely OK) with no failure. But one of the cards disappeared from nvidia-smi about 40 minutes after training completed (so the cards were idle when the disappearance took place). This pattern of disappearing GPU after training has been observed reliably time after time. The disappearance never happens during training, only after, when the cards are idling. This looks very much like the issue here https://devtalk.nvidia.com/default/topic/1011704/nvidia-smi-suddenly-loss-one-of-three-cards/ , and I suspect it’s some power management thing – suspended card unable to wake up.

Are there logs to show any power management actions? Also, I saw the suggestion to add kernel parameter pcie_port_pm=off, where do I add this? I’m running Ubuntu 17.04.

Thank you again!

power management of the type you are describing (suspend, resume) is an OS level function. You should investigate how Ubuntu 17.04 does power management and look for options to disable it, if you want to explore.

adding a kernel parameter is also an OS specific thing. I suggest you study that with respect to Ubuntu 17.04. You should find plenty of help for that on the internet, or on sites dedicated to ubuntu, such as askubuntu.com