One of two 1080Ti GPUs not detected after CUDA failure

kisantal.mate · April 23, 2018, 8:12pm

I have two 1080Ti GPUs, both of them were working fine.
However, recently after about 10 hours of heavy use (deep learning with the Darknet framework), Darknet stopped and reported a CUDA error, and nvidia-smi showed “ERR!” for the GPU Fan percentage and power usage of GPU:1.

I restarted the machine, and ever since then only one GPU is listed by nvidia-smi (and also the command takes 3-4 seconds to run whereas it has been instantaneous before).

Can it be a hardware issue?

Output of dmesg |grep NVRM

$ dmesg |grep NVRM 
[    1.245254] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  384.130  Wed Mar 21 03:37:26 PDT 2018 (using threaded interrupts)
[    3.947596] NVRM: GPU at PCI:0000:02:00: GPU-7f718c05-43f7-bf45-4b40-7e10cb5bb811
[    3.947598] NVRM: GPU Board Serial Number: 
[    3.947600] NVRM: Xid (PCI:0000:02:00): 62, 1d32(3818) 00000000 00000000
[   48.336050] NVRM: Xid (PCI:0000:02:00): 32, Channel ID 00000000 intr 80042000
[   48.340309] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[   48.340363] NVRM: rm_init_adapter failed for device bearing minor number 1
[   63.852084] NVRM: Xid (PCI:0000:02:00): 32, Channel ID 00000000 intr 80002000
[   63.856771] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[   63.856799] NVRM: rm_init_adapter failed for device bearing minor number 1
[   68.188218] NVRM: Xid (PCI:0000:02:00): 32, Channel ID 00000000 intr 80002000
[   68.192759] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[   68.192791] NVRM: rm_init_adapter failed for device bearing minor number 1
[ 1741.841601] NVRM: Xid (PCI:0000:02:00): 32, Channel ID 00000000 intr 80002000
[ 1741.846237] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[ 1741.846266] NVRM: rm_init_adapter failed for device bearing minor number 1
[ 1770.605891] NVRM: Xid (PCI:0000:02:00): 32, Channel ID 00000000 intr 80002000
[ 1770.610342] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[ 1770.610356] NVRM: rm_init_adapter failed for device bearing minor number 1

Output of lspci | grep NVIDIA

01:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
02:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1)

Output of nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
| 34%   55C    P8    19W / 250W |    420MiB / 11169MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1097      G   /usr/lib/xorg/Xorg                           257MiB |
|    0      1851      G   compiz                                       160MiB |
+-----------------------------------------------------------------------------+

kisantal.mate · April 23, 2018, 8:15pm

nvidia-smi output upon the error:

Robert_Crovella · April 23, 2018, 10:08pm

If you haven’t done this already, power down the machine, let things fully cool off, then power up.

Yes, it could be a HW issue.

kisantal.mate · April 23, 2018, 10:15pm

Thank you txbob, I did try it already, didn’t help unfortunately. Let me know if there is anything else I can try.

njuffa · April 24, 2018, 5:29am

One thing you can try is physically swapping the two GPUs between their respective PCIe slots. Does the problem follow the card or is it correlated with a particular PCIe slot?

While you have the box open, double-check the power supply cables to the GPUs, and watch out for anything obstructing airflow around the GPUs (including excessive amounts of dust). Are the fans turning on both GPUs? I think it’s possible that on Pascal-family parts the fans are turned off when the GPU is idle, so that observation might not help gain additional information.

Is this machine being operated in any sort of harsh environment (e.g. high ambient temperature, high humidity, vibrations, high altitude, “noisy” electrical supply)?

Are these GTX 1080 Ti using NVIDIA reference clocks, or are these vendor overclocked models (may use “superclocked” or some such as part of the name)?

[url]https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-1080-ti/specifications[/url]
Graphics Clock (MHz) 1480
Processor Clock (MHz) 1582

kisantal.mate · April 24, 2018, 6:04pm

Hi njuffa,

I have the machine running in a normal office, nothing extreme.

I tried to switch the GPUs, the error prevails, on the same GPU. When I start up the machine with the screen connected to the correct GPU, it’s able to boot Linux, but not detecting the other GPU. When I connect the screen to the other, then I see strange artifacts, and ubuntu fails to boot:

njuffa · April 24, 2018, 6:20pm

It is hard to be sure when making a remote diagnosis based on limited information, but it appears your GPU may have suffered permanent damage of some sort. If you have access to a local expert you can consult, you might want them to take a closer look at your setup.

Generally speaking, consumer graphics cards like the GTX 1080 Ti are not designed for 24/7 operation under full load. Those that choose to operate their GPU in this fashion anyway for cost reasons should take into account the risk of equipment failure. Some would argue that even considering replacement costs for failed parts their total cost still compares favorable to deploying NVIDIA’s professional solutions. That is a trade-off everybody has to evaluate for themselves.

When using consumer GPUs in any mission-critical function, I would recommend to at least avoid vendor-overclocked parts (which exploit much of the engineering margin built into the GPUs to offer a faster product), and to pay particular attention to adequate power supply and cooling.

dwSun · April 27, 2018, 9:01am

Hi kisantal.mate

How long it is since your machine first boot?

2 of 4 cards in our lab failed after upgrade to cuda9 and cudnn7 within 1 month.
Before that, those cards work 24/7 for about 1 year.

Topic		Replies	Views
new installed thired GTX stopped beging identified CUDA Setup and Installation	3	731	October 11, 2017
Nvidia-smi failed to detect all GPU cards CUDA Setup and Installation	11	13275	December 14, 2018
Ubuntu 16.04+2 GTX1080 Ti: Nvidia-smi failed to detect all GPUs CUDA Setup and Installation	9	10318	February 5, 2018
RmInitAdapter failed (repeatedly) for one of two RTX2080TI on Ubuntu 18.04 CUDA Setup and Installation	6	2795	August 14, 2020
nvidia-smi "No devices were found" error CUDA Setup and Installation	23	62270	February 14, 2021
Nvidia-smi recognize H100 when Firmware is disable Confidential Computing cuda , ubuntu	10	437	September 11, 2024
The GPU FAN runs heavily after the process is done. CUDA Setup and Installation	19	4762	July 20, 2017
Nvidia command cannot see second GPU CUDA Setup and Installation cuda , ubuntu , nvbugs	1	2098	August 30, 2022
GPU loss Linux	7	13677	April 3, 2019
Reset dedicated GPU after it gets stuck Linux cuda , linux , nvidia-smi	7	19172	August 30, 2023

One of two 1080Ti GPUs not detected after CUDA failure

Related topics