Hi,
I am running ubuntu 17.04 and nvidia driver 387.12 with 2 1080 TI’s. When I first boot the machine, nvidia-smi sees both GPUs. But after some hours of idling, one of the GPUs would disappear from nvidia-smi and become unusable. Upon reboot, I would see both GPUs again, only to lose one after some time again. What could be the problem? Below are some common commands and output I’ve tried to see if you guys can make sense of it.
lspci | grep VGA
17:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
65:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
dmesg | grep -i nvrm | head
[170045.159496] NVRM: rm_init_adapter failed for device bearing minor number 1
[170051.110824] NVRM: RmInitAdapter failed! (0x24:0x65:1076)
[170051.110881] NVRM: rm_init_adapter failed for device bearing minor number 1
[170057.117426] NVRM: RmInitAdapter failed! (0x24:0x65:1076)
[170057.117504] NVRM: rm_init_adapter failed for device bearing minor number 1
[170063.129279] NVRM: RmInitAdapter failed! (0x24:0x65:1076)
[170063.129351] NVRM: rm_init_adapter failed for device bearing minor number 1
[170069.202843] NVRM: RmInitAdapter failed! (0x24:0x65:1076)
[170069.202919] NVRM: rm_init_adapter failed for device bearing minor number 1
[170075.126983] NVRM: RmInitAdapter failed! (0x24:0x65:1076)
...
nvidia-smi
Wed Dec 20 09:41:41 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.12 Driver Version: 387.12 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:17:00.0 Off | N/A |
| 0% 29C P5 22W / 300W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+