NVRM crash?

lukee2ni6 · May 3, 2021, 9:33am

Our server went down recently and the last thing I see in the logs is an NVRM related error. Is this the likely culprit? Is there some way of understanding/diagnosing and ensuring it doesn’t happen again?

Ubuntu 20.04 (desktop)

May 03 08:43:06 data kernel: NVRM: GPU at PCI:0000:02:00: GPU-f504ebd8-2f9a-dd0a-da67-0df486b6c42f
May 03 08:43:06 data kernel: NVRM: GPU Board Serial Number: 0322616002793
May 03 08:43:06 data kernel: NVRM: Xid (PCI:0000:02:00): 61, pid=2281, 0a99(17e0) 00000000 00000000
May 03 08:43:19 data kernel: NVRM: Xid (PCI:0000:02:00): 8, pid=2232, Channel 00000001



$ nvidia-smi 
Mon May  3 10:21:37 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.56       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro M4000        Off  | 00000000:02:00.0 Off |                  N/A |
| 47%   42C    P8    12W / 120W |     64MiB /  8121MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40c          Off  | 00000000:81:00.0 Off |                    0 |
| 23%   42C    P8    23W / 235W |      5MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2246      G   /usr/lib/xorg/Xorg                 51MiB |
|    0   N/A  N/A      2482      G   /usr/bin/gnome-shell                9MiB |
|    1   N/A  N/A      2246      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

Mart · May 3, 2021, 9:45am

The Xid errors are (more or less) documented here:
https://docs.nvidia.com/deploy/xid-errors/index.html#topic_5_2

Unfortunately I’m not able to give you more insights.

generix · May 3, 2021, 10:10am

XID 61 can be caused by a lot of issues. Taking into account that this server has been working for a long time, it might be a hint towards the Quadro is beginning to fail. Or its fans are full of dust so the memory is overheating under load, since the fan is running at 47% while idle.

Topic		Replies	Views
NVRM XID 79 on Ubuntu 20.04 Linux	1	711	December 13, 2022
NVRM Xid 55? CUDA Programming and Performance	2	5950	October 3, 2008
Deciphering an NVRM: Xid message? CUDA Programming and Performance	27	78278	April 1, 2012
465.27 NVRM: Xid errors on a Quadro RTX 3000 Mobile / Max-Q Linux	0	522	May 9, 2021
what does NVRM error on earth mean? CUDA Programming and Performance	16	14272	February 27, 2009
X server random crash / frozen - 2080 (Ubuntu 16.04.5 - Driver 410.48) Linux	1	1160	December 1, 2018
[370.xx] NVRM: Xid (PCI:0000:13:00): 31, Ch 00000010, engmask 00000111, intr 10000000 Linux	1	1174	August 18, 2016
[SOLVED - RMA] Freeze when gaming, multiple NVRM errors -Driver issues? Linux	8	5091	October 12, 2021
X Server 1.13.1 deadlocks randomly on GeForce GTX680 Linux	6	3139	January 4, 2013
[SOLVED] XID 62: fixeable? Linux	3	4115	November 23, 2017

NVRM crash?

Related topics