Xorg hangs with 100% CPU usage, can't be killed, dmesg spam about idling display engine

We have a recurring issue where, when running OpenGL programs, Xorg will hang, start using 100% CPU and refuse to be killed. Concurrently with this Xorg hang, we see a ton of noise from nvidia-modeset in the kernel message buffer. We also find that gdb is unable to attach to the Xorg process to see where it’s spinning.

We’ve experienced this on at least three separate machines with Nvidia cards and the proprietary driver in use. The only resolution we’ve managed to find is to reboot. We’ve done a lot of experimentation with proprietary driver versions recently, and it’s unclear when exactly this started happening, or if this issue is related to a given proprietary driver version. In both of the instances we have logs for, we were using proprietary driver version 384.90.

On two affected machines, the Nvidia cards are 3x Quadro K5000. On another affected machine, the Nvidia cards are 3x Quadro M4000 . Unfortunately, we don’t have nvidia-bug-report.sh logs for the M4000 machine, but we will attach them in a follow up if we manage to see the same hang again.

Here are links to nvidia-bug-report.sh output from two incidents on two different machines: