NVRM XID 79 on Ubuntu 20.04

nvidia-bug-report.log.gz (1.4 MB)

root@frlab05:/home/ubuntu# cat /var/log/kern.log

Nov 21 20:02:00 frlab05-ens5f0 kernel: [17794888.749208] NVRM: GPU at PCI:0000:1b:00: GPU-be958551-d52c-ad0a-d259-ba525bfadcb3
Nov 21 20:02:00 frlab05-ens5f0 kernel: [17794888.749316] NVRM: GPU Board Serial Number:
Nov 21 20:02:00 frlab05-ens5f0 kernel: [17794888.749321] NVRM: Xid (PCI:0000:1b:00): 43, pid=1311386, Ch 00000008
Nov 22 10:55:41 frlab05-ens5f0 kernel: [17848509.423142] NVRM: Xid (PCI:0000:1b:00): 43, pid=2801550, Ch 00000008
Nov 22 13:32:45 frlab05-ens5f0 kernel: [17857932.576056] NVRM: Xid (PCI:0000:1b:00): 43, pid=3056685, Ch 00000008
Nov 23 12:51:02 frlab05-ens5f0 kernel: [17941828.707203] NVRM: Xid (PCI:0000:1b:00): 79, pid=0, GPU has fallen off the bus.
Nov 23 12:51:02 frlab05-ens5f0 kernel: [17941828.707252] NVRM: GPU 0000:1b:00.0: GPU has fallen off the bus.
Nov 23 12:51:02 frlab05-ens5f0 kernel: [17941828.707344] NVRM: GPU 0000:1b:00.0: GPU is on Board ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
Nov 23 12:51:02 frlab05-ens5f0 kernel: [17941828.707356] NVRM: A GPU crash dump has been created. If possible, please run
Nov 23 12:51:02 frlab05-ens5f0 kernel: [17941828.707356] NVRM: nvidia-bug-report.sh as root to collect this data before
Nov 23 12:51:02 frlab05-ens5f0 kernel: [17941828.707356] NVRM: the NVIDIA kernel module is unloaded.


root@frlab05:/home/ubuntu# nvidia-smi

Unable to determine the device handle for GPU 0000:1B:00.0: Unknown Error

SYSTEM - Supermicro 4029GP-TRT2
GPU - RTX 3090 * 8EA
OS - Ubuntu 20.04 LTS
I’m running 40 servers of the same specification.
A GPU error occurs while some of the servers are in use.
I would like to know what could be the cause.

Thanks.

nvidia-persistenced not running, lack of power on peaks, overheating, defective gpu.