Preemption of processes causes GPUs to be unusable

Hi there,

We are currently running a slurm cluster with preemption enabled. This means that user processes running on GPUs(P100) can be killed without any possible cleanup effort made by the user program.

Recently we’ve noticed some problems with our P100 GPUs (CUDA version 11.2, driver version 460.32.03, 4 GPUs/node). The problem is very similar to Program runs with status D where a lot of our processes are in D+ state, which can be triggered by simply running the deviceQuery program or cuda init calls in frameworks like tensorflow or pytorch causing our nodes to be unusable. We are wondering if there’s any long term solution to this other than just restarting the nodes, since just restarting the node causes significant outages to our cluster.

We are running
Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-74-generic x86_64)
4 x Tesla P100-PCIE-12GB
Driver Version: 460.32.03 CUDA Version: 11.2
32 x Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz

We believe the GPU driver is stuck at an unusable state, as nvidia-smi also reports weird info, the utilization falsely reports 100% utilization while no process is actually running on top of them.