I. Description of issue
I recently took delivery of a new GPU server, which has been periodically crashing since installing the Nvidia drivers. After a few days of frustration, I’ve found that the problem crops up when I run nvidia-smi
, but only periodically. For instance, simply running
nvidia-smi
can sometimes cause a crash and
nvidia-smi -l 1
is pretty much guaranteed to cause a crash within an hour or so. I’ve tested with two different drivers (installed via apt-get nvidia-XXX
), both of which have this issue:
- 367.44
- 370.23
In case they are useful, details about the software and hardware as well as the result of running nvidia-bug-report.sh
follow.
II. Software details
- OS: Ubuntu 16.04 LTS
- Driver version: 367.44
III Hardware details
- CPU: 2 x Intel Xeon E5-2680 v4
- Motherboard: SuperMicro X10DRG-O±CPU
- GPU: 6 x Nvidia GTX 1080
- Memory: 24 x 16GB DDR4-2400 ECC operating operating at 1600 MT/s
IV. Log file
nvidia-bug-report.sh
was run upon reboot after the most recent crash and the output is available at [url]https://drive.google.com/file/d/0B2tuXP9BWQtnNXBoZ3JGZkZBejA/view?usp=sharing[/url].