A100: GPU freezes after process gets OOMKilled

I’m running two A100 GPUs on a Dell PowerEdge R640 that is running RHEL8 with driver version 515.48.07. Whenever a process that is using the GPU gets killed by the linux oomkiller, the GPU freezes. No other processes can use the GPU, and the “nvidia-smi” command stalls out. The kernel log shows that the GPU is logging this message over and over:

NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).

That xid error appears to be undocumented. The full output of “dmesg -T” is attached kern.log (11.0 KB)

The only way I’ve been able to recover from this situation is to reboot the whole server. Since this server is running Kubernetes with the nvidia gpu-operator, oomkills are common (due to containers exceeding RAM limits) and reboots are pretty disruptive. I’ve filed a ticket with Red Hat, but they said to ask here as well. Any ideas?

There is a workaround for this issue mentioned here: GPU freezes when dcgm-exporter is SIGKILL'd · Issue #84 · NVIDIA/dcgm-exporter · GitHub

A800 GPU
centos7
Driver Version: 525.85.12
We also had the same problem ,GPU card is error , An error message appears in dmesg ;

(pected function 76 (GSP RM CONTROL) (0x2080a612 0xd98)196959.163565] NVRM: Xid (PCI::b1:0a): 119, pid=3556,name=nvidia-smi, Timeout waiting for RPC from GSP! E<pected function 10 (FREE) (xa55a7 x)196995.164496] NVRM: Xid (PI::b1:00): 119,pid=3556.name=nvidia-smi, Timeout waiting for RPC from GSP!1xpected function 19 (FREE) (xa55a0060 0x0).197040.166462] NVRM: Xid (PCI::b1:00): 119, pid=3585,name=nvidia-smi, Timeout waiting for RPCGSPfromxpected function 76 (GSP RM CONTROL) (x20809009 0x8).197085.1673817 NVRM: Xid (PcI:00:b1:00): 119, pid=3585,name=nvidia-smi.Timeout waiting for RPC from GSp!xpected function 76 (GSP RM CONTROL) (0x208a4c 0x4),name=nvidia-smi, Timeout waiting for RPC from GSP!197130.182314 NVRM: Xid (PcI:00:b1:0): 119,pid=3585,1xpected function 76 (GSP RM CONTROL) (0x280a4c 0x4).

I encountered this issue in the L4 535.86.10 driver version.

[ 1339.676676] NVRM: Xid (PCI:0007:01:00): 119, pid=34378, name=nvidia-smi, Timeout waiting for RPC from GSP0! Expected function 76 (GSP_RM_CONTROL) (0x0 0x0).

UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+