One of two 1080Ti GPUs not detected after CUDA failure

I have two 1080Ti GPUs, both of them were working fine.
However, recently after about 10 hours of heavy use (deep learning with the Darknet framework), Darknet stopped and reported a CUDA error, and nvidia-smi showed “ERR!” for the GPU Fan percentage and power usage of GPU:1.

I restarted the machine, and ever since then only one GPU is listed by nvidia-smi (and also the command takes 3-4 seconds to run whereas it has been instantaneous before).

Can it be a hardware issue?

Output of dmesg |grep NVRM

$ dmesg |grep NVRM 
[    1.245254] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  384.130  Wed Mar 21 03:37:26 PDT 2018 (using threaded interrupts)
[    3.947596] NVRM: GPU at PCI:0000:02:00: GPU-7f718c05-43f7-bf45-4b40-7e10cb5bb811
[    3.947598] NVRM: GPU Board Serial Number: 
[    3.947600] NVRM: Xid (PCI:0000:02:00): 62, 1d32(3818) 00000000 00000000
[   48.336050] NVRM: Xid (PCI:0000:02:00): 32, Channel ID 00000000 intr 80042000
[   48.340309] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[   48.340363] NVRM: rm_init_adapter failed for device bearing minor number 1
[   63.852084] NVRM: Xid (PCI:0000:02:00): 32, Channel ID 00000000 intr 80002000
[   63.856771] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[   63.856799] NVRM: rm_init_adapter failed for device bearing minor number 1
[   68.188218] NVRM: Xid (PCI:0000:02:00): 32, Channel ID 00000000 intr 80002000
[   68.192759] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[   68.192791] NVRM: rm_init_adapter failed for device bearing minor number 1
[ 1741.841601] NVRM: Xid (PCI:0000:02:00): 32, Channel ID 00000000 intr 80002000
[ 1741.846237] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[ 1741.846266] NVRM: rm_init_adapter failed for device bearing minor number 1
[ 1770.605891] NVRM: Xid (PCI:0000:02:00): 32, Channel ID 00000000 intr 80002000
[ 1770.610342] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[ 1770.610356] NVRM: rm_init_adapter failed for device bearing minor number 1

Output of lspci | grep NVIDIA

01:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
02:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1)

Output of nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
| 34%   55C    P8    19W / 250W |    420MiB / 11169MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1097      G   /usr/lib/xorg/Xorg                           257MiB |
|    0      1851      G   compiz                                       160MiB |
+-----------------------------------------------------------------------------+

nvidia-smi output upon the error:

https://img42.com/WDccZ+

If you haven’t done this already, power down the machine, let things fully cool off, then power up.

Yes, it could be a HW issue.

Thank you txbob, I did try it already, didn’t help unfortunately. Let me know if there is anything else I can try.

One thing you can try is physically swapping the two GPUs between their respective PCIe slots. Does the problem follow the card or is it correlated with a particular PCIe slot?

While you have the box open, double-check the power supply cables to the GPUs, and watch out for anything obstructing airflow around the GPUs (including excessive amounts of dust). Are the fans turning on both GPUs? I think it’s possible that on Pascal-family parts the fans are turned off when the GPU is idle, so that observation might not help gain additional information.

Is this machine being operated in any sort of harsh environment (e.g. high ambient temperature, high humidity, vibrations, high altitude, “noisy” electrical supply)?

Are these GTX 1080 Ti using NVIDIA reference clocks, or are these vendor overclocked models (may use “superclocked” or some such as part of the name)?

https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-1080-ti/specifications
Graphics Clock (MHz) 1480
Processor Clock (MHz) 1582

Hi njuffa,

I have the machine running in a normal office, nothing extreme.

I tried to switch the GPUs, the error prevails, on the same GPU. When I start up the machine with the screen connected to the correct GPU, it’s able to boot Linux, but not detecting the other GPU. When I connect the screen to the other, then I see strange artifacts, and ubuntu fails to boot:

https://img42.com/b-oRr+

It is hard to be sure when making a remote diagnosis based on limited information, but it appears your GPU may have suffered permanent damage of some sort. If you have access to a local expert you can consult, you might want them to take a closer look at your setup.

Generally speaking, consumer graphics cards like the GTX 1080 Ti are not designed for 24/7 operation under full load. Those that choose to operate their GPU in this fashion anyway for cost reasons should take into account the risk of equipment failure. Some would argue that even considering replacement costs for failed parts their total cost still compares favorable to deploying NVIDIA’s professional solutions. That is a trade-off everybody has to evaluate for themselves.

When using consumer GPUs in any mission-critical function, I would recommend to at least avoid vendor-overclocked parts (which exploit much of the engineering margin built into the GPUs to offer a faster product), and to pay particular attention to adequate power supply and cooling.

Hi kisantal.mate

How long it is since your machine first boot?

2 of 4 cards in our lab failed after upgrade to cuda9 and cudnn7 within 1 month.
Before that, those cards work 24/7 for about 1 year.