Preemption of processes causes GPUs to be unusable

Hi there,

We are currently running a slurm cluster with preemption enabled. This means that user processes running on GPUs(P100) can be killed without any possible cleanup effort made by the user program.

Recently we’ve noticed some problems with our P100 GPUs (CUDA version 11.2, driver version 460.32.03, 4 GPUs/node). The problem is very similar to Program runs with status D where a lot of our processes are in D+ state, which can be triggered by simply running the deviceQuery program or cuda init calls in frameworks like tensorflow or pytorch causing our nodes to be unusable. We are wondering if there’s any long term solution to this other than just restarting the nodes, since just restarting the node causes significant outages to our cluster.

We are running
OS:
Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-74-generic x86_64)
GPU:
4 x Tesla P100-PCIE-12GB
Driver Version: 460.32.03 CUDA Version: 11.2
Host
32 x Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
180GB RAM

We believe the GPU driver is stuck at an unusable state, as nvidia-smi also reports weird info, the utilization falsely reports 100% utilization while no process is actually running on top of them.

info

Thanks!

Best,
Gerald