NVRM crash?

Our server went down recently and the last thing I see in the logs is an NVRM related error. Is this the likely culprit? Is there some way of understanding/diagnosing and ensuring it doesn’t happen again?

Ubuntu 20.04 (desktop)

May 03 08:43:06 data kernel: NVRM: GPU at PCI:0000:02:00: GPU-f504ebd8-2f9a-dd0a-da67-0df486b6c42f
May 03 08:43:06 data kernel: NVRM: GPU Board Serial Number: 0322616002793
May 03 08:43:06 data kernel: NVRM: Xid (PCI:0000:02:00): 61, pid=2281, 0a99(17e0) 00000000 00000000
May 03 08:43:19 data kernel: NVRM: Xid (PCI:0000:02:00): 8, pid=2232, Channel 00000001



$ nvidia-smi 
Mon May  3 10:21:37 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.56       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro M4000        Off  | 00000000:02:00.0 Off |                  N/A |
| 47%   42C    P8    12W / 120W |     64MiB /  8121MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40c          Off  | 00000000:81:00.0 Off |                    0 |
| 23%   42C    P8    23W / 235W |      5MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2246      G   /usr/lib/xorg/Xorg                 51MiB |
|    0   N/A  N/A      2482      G   /usr/bin/gnome-shell                9MiB |
|    1   N/A  N/A      2246      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

The Xid errors are (more or less) documented here:
https://docs.nvidia.com/deploy/xid-errors/index.html#topic_5_2

Unfortunately I’m not able to give you more insights.

1 Like

XID 61 can be caused by a lot of issues. Taking into account that this server has been working for a long time, it might be a hint towards the Quadro is beginning to fail. Or its fans are full of dust so the memory is overheating under load, since the fan is running at 47% while idle.

1 Like