quad 780 GTX crashing after 30 hours (X79 board, 325.15 drivers, Ubuntu 12.04)

Hello -

I have 2 times quad GTX 780 running on X79 boards for CUDA on Ubuntu 12.04. I experience on the GTX 780 a dropout of the driver after about 30 hours of runtime. System reports that I don’t have Nvidia drivers running an longer. I have read something about a 36 hour TDR bug that has been fixed on Windows. Is that related? What driver do you recommend for Linux? I have no problems with my 6 Titans running in 319.32. All 319 drivers don’t see 4 780 cards, 325.08 worked for some time (< 30 hours), 325.15 crashed once - trying to reproduce.

I don’t have a nvidia-bug-report.log for this yet, I re-installed drivers after the crash.
EDIT: I have now attached the bug report and logfile.


nvidia.tar (120 KB)

Attached the bug report.

These crashes are like clockwork. Exactly every 36 hours these things happen:

  1. nvidia-smi reports errors and wrong fan speeds
  2. After some time (< 10 min) I can’t use X terminals any longer and have to do a hard reboot

Looks like other people have similar problems. Please email me if you have some pointers. Still hopeful but getting very frustrated.

I’m in the same situation. In my case it occurs with 2 GTX 780 running on Ubuntu 12.04. Each graphic card is running on separated PCs as secondary video card but when nvidia-smi reports errors and wrong fan speeds it also stops the X, even though they are not connected to the monitor (they are only used for CUDA computations).

Please provide reproduction steps in detail. What application you guys are running for while? Did system crash in idle state?

Thanks sandipt for responding, but I’ve resolved the problem installing the drivers 319.60. Previously, I had tried with the previous versions of 319 without succces. In my case, the problem was caused without any apparent execution of programs on the graphic cards. On the PCs installed, they were configured as a secondary video card, and prepared for CUDA computing, but non execution was needed to make them crash. Those PCs were running Ubuntu 12.10, CPU i7-3820, 16GB RAM, etc…