The system appears to function perfectly most of the time. However, I can produce a full system lockup (not even the mouse cursor moves and will not respond to pings) by joining a Google Hangouts conference (after a minute or two), or by running a cudaNN job.
- 1 - Test hardware/OS
Supermicro X7-DWA-N MB, Dual Xeon E5472, 16GB RAM, MSI GTX 1060 ARMOR 3G OCV1 NVIDIA GEFORCE GDDR5, Zalman ZM1000-EBT 1000w power supply, HP Z34c Quad HD LCD Monitor (connected via DisplayPort), Ubuntu 14.04.5 LTS x86_64, Kernel 4.4.0-47-generic (also tested with 3.19.74).
All tests run with Linux drivers 367.57 and 375.20. Nouveau blacklisted and no modules running:
rab-wksta:~$ sudo lsmod | grep nv
nvidia_drm 53248 1
nvidia_modeset 790528 5 nvidia_drm
nvidia 11911168 81 nvidia_modeset
drm_kms_helper 151552 1 nvidia_drm
drm 360448 4 drm_kms_helper,nvidia_drm
Note that there is nothing in syslog or dmesg. Apparently the hard freeze is sudden and there is no opportunity to log anything.
- 2 - Reproduce
I can reproduce the lockup consistently by using cudaNN. I have the Cuda toolkit installed along with Tensorflow. If I attempt to run a test on Tensorflow the entire machine will lock up after a minute or two. I am running the Tensorflow test ./tensorflow/models/image/mnist.
I have an NVidia monitor onscreen - the temperature never exceeds 53degC. Even the GPU clock and memory clock are well below full throttle.
This occurs with both driver versions 367.57 and 375.20, and kernels from 3.19.x to 4.4.0-47.
- 3 - Is it just when the card is being used to its fullest capability?
I can run the Unigine_Heaven-4.0 benchmark on the graphics card and after 30 minutes the machine has still not locked up. The clock and temp are quite a bit higher than when running the cudaNN job.
- 4 - Hardware or Linux drivers? What about under Windows 10?
Because I wanted to ensure that there were no hardware issues, I performed the following test:
- I disconnected my machine’s drives (Linux OS)
- I attached a new hard drive
- I installed Windows 10
- I installed the latest versions of the NVIDIA drivers, Cuda Toolkit, cuDNN, and tensorflow
- I ran the same tensorflow sample 3 consecutive times (model/images/mnist)
The machine completed the tests all 3 times without a hitch. I have never been able to complete the test on Linux.
- 5 - Problem with specific card or not?
At my own expense, I purchased the following card:
ZOTAC GeForce GTX 1050 Ti OC Edition 4GB GDDR5 128-bit DL-DVI Graphic Card (ZT-P10510B-10L)
Running drivers 375.20, I was able to reproduce the problem. The training never completed and I had a hard system freeze.
nvidia-bug-report.log.gz (118 KB)