I am experiencing intermittent problems using the Quadro M4000 card under Linux. I am seeking guidance on how I might isolate this problem to see if it’s actually the driver or if it might be something else. I’ve seen the problem with nvidia driver versions 358.16 (installed through rpmfusion), 361.28 (installed through the nvidia binary installer), and 361.42 (installed through the nvidia binary installer).
I run software on a cluster running a video wall. There is one display per computer. I have a system that launches processes to show output on all the screens at once, and this software also coordinates the computers in the cluster to make the video wall act as a single display with a total pixel resolution of 9600x3240 or 5760x3240. I have two new sets of machines, one set is a set of rackmount machines made by Dell. The second is a set of shuttle form factor machines made by OriginPC. Both of these sets of machines have nvidia Quadro M4000 cards and show the problem. I have other sets of machines (Mac minis and Xi3 x7a), with AMD graphics, that do not show the problem. All machines are running Fedora 22. I did not see this problem with any previous set of machines running a variety of operating systems and having a variety of graphics cards (nvidia or AMD).
The problem manifests itself like this. When a new program starts (the same program starts on all the machines in the cluster), very occasionally the screen on one or more machines will display only one frame every 30 seconds or so. The program appears to be running normally, but the display is updated very very slowly. When the display finally updates, it is caught up to the other machines in the cluster, but then shows the updated frame for 30-60 seconds. This continues indefinitely, but sometimes the display fixes itself and sometimes it continues to run in this manner for a long time. Sometimes switching the application running on the machine fixes the problem, and sometimes the problem persists across application changes. Rebooting seems to fix the problem for a while. I’ve seen this problem occur with different applications. This problem is particularly frustrating because it doesn’t happen often and is not consistently reproducible. The problem also seems to clear up and cannot be reproduced at all after the machine has been running for a while (anywhere from half an hour to a couple of hours).
Any guidance on how I might isolate this problem would be greatly appreciated.