We are experiencing a strange hang situation on our servers running the GeForce GTX 1080 Ti card - the processes that are attempting to use the GPUs hang in process state “S” (“interruptible Sleep”, waiting for an event to complete) and the load average goes insanely high (in part because I am kicking off a process every 5 min’s that accesses the management library for GPU stats, and they pile up when the GPUs hang.)
I’m attaching what I got out of “nvidia-bug-report.sh” to this post, hopefully it’s enough to help pinpoint the problem…
nvidia-bug-report.log.gz (55.1 KB)