Our server went down recently and the last thing I see in the logs is an NVRM related error. Is this the likely culprit? Is there some way of understanding/diagnosing and ensuring it doesn’t happen again?
Ubuntu 20.04 (desktop)
May 03 08:43:06 data kernel: NVRM: GPU at PCI:0000:02:00: GPU-f504ebd8-2f9a-dd0a-da67-0df486b6c42f
May 03 08:43:06 data kernel: NVRM: GPU Board Serial Number: 0322616002793
May 03 08:43:06 data kernel: NVRM: Xid (PCI:0000:02:00): 61, pid=2281, 0a99(17e0) 00000000 00000000
May 03 08:43:19 data kernel: NVRM: Xid (PCI:0000:02:00): 8, pid=2232, Channel 00000001
$ nvidia-smi
Mon May 3 10:21:37 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro M4000 Off | 00000000:02:00.0 Off | N/A |
| 47% 42C P8 12W / 120W | 64MiB / 8121MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K40c Off | 00000000:81:00.0 Off | 0 |
| 23% 42C P8 23W / 235W | 5MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2246 G /usr/lib/xorg/Xorg 51MiB |
| 0 N/A N/A 2482 G /usr/bin/gnome-shell 9MiB |
| 1 N/A N/A 2246 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+