I have two 1080Ti GPUs, both of them were working fine.
However, recently after about 10 hours of heavy use (deep learning with the Darknet framework), Darknet stopped and reported a CUDA error, and nvidia-smi showed “ERR!” for the GPU Fan percentage and power usage of GPU:1.
I restarted the machine, and ever since then only one GPU is listed by nvidia-smi (and also the command takes 3-4 seconds to run whereas it has been instantaneous before).
Can it be a hardware issue?
Output of dmesg |grep NVRM
$ dmesg |grep NVRM
[ 1.245254] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 384.130 Wed Mar 21 03:37:26 PDT 2018 (using threaded interrupts)
[ 3.947596] NVRM: GPU at PCI:0000:02:00: GPU-7f718c05-43f7-bf45-4b40-7e10cb5bb811
[ 3.947598] NVRM: GPU Board Serial Number:
[ 3.947600] NVRM: Xid (PCI:0000:02:00): 62, 1d32(3818) 00000000 00000000
[ 48.336050] NVRM: Xid (PCI:0000:02:00): 32, Channel ID 00000000 intr 80042000
[ 48.340309] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[ 48.340363] NVRM: rm_init_adapter failed for device bearing minor number 1
[ 63.852084] NVRM: Xid (PCI:0000:02:00): 32, Channel ID 00000000 intr 80002000
[ 63.856771] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[ 63.856799] NVRM: rm_init_adapter failed for device bearing minor number 1
[ 68.188218] NVRM: Xid (PCI:0000:02:00): 32, Channel ID 00000000 intr 80002000
[ 68.192759] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[ 68.192791] NVRM: rm_init_adapter failed for device bearing minor number 1
[ 1741.841601] NVRM: Xid (PCI:0000:02:00): 32, Channel ID 00000000 intr 80002000
[ 1741.846237] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[ 1741.846266] NVRM: rm_init_adapter failed for device bearing minor number 1
[ 1770.605891] NVRM: Xid (PCI:0000:02:00): 32, Channel ID 00000000 intr 80002000
[ 1770.610342] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[ 1770.610356] NVRM: rm_init_adapter failed for device bearing minor number 1
Output of lspci | grep NVIDIA
01:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
02:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1)
Output of nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130 Driver Version: 384.130 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 On | N/A |
| 34% 55C P8 19W / 250W | 420MiB / 11169MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1097 G /usr/lib/xorg/Xorg 257MiB |
| 0 1851 G compiz 160MiB |
+-----------------------------------------------------------------------------+