A100: GPU freezes after process gets OOMKilled

I’m running two A100 GPUs on a Dell PowerEdge R640 that is running RHEL8 with driver version 515.48.07. Whenever a process that is using the GPU gets killed by the linux oomkiller, the GPU freezes. No other processes can use the GPU, and the “nvidia-smi” command stalls out. The kernel log shows that the GPU is logging this message over and over:

NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).

That xid error appears to be undocumented. The full output of “dmesg -T” is attached kern.log (11.0 KB)

The only way I’ve been able to recover from this situation is to reboot the whole server. Since this server is running Kubernetes with the nvidia gpu-operator, oomkills are common (due to containers exceeding RAM limits) and reboots are pretty disruptive. I’ve filed a ticket with Red Hat, but they said to ask here as well. Any ideas?

There is a workaround for this issue mentioned here: GPU freezes when dcgm-exporter is SIGKILL'd · Issue #84 · NVIDIA/dcgm-exporter · GitHub