Hard system freeze with Linux drivers. Tested with both a GTX 1060 or GTX 1050 Ti. Same hardware w...

The system appears to function perfectly most of the time. However, I can produce a full system lockup (not even the mouse cursor moves and will not respond to pings) by joining a Google Hangouts conference (after a minute or two), or by running a cudaNN job.

  • 1 - Test hardware/OS

Supermicro X7-DWA-N MB, Dual Xeon E5472, 16GB RAM, MSI GTX 1060 ARMOR 3G OCV1 NVIDIA GEFORCE GDDR5, Zalman ZM1000-EBT 1000w power supply, HP Z34c Quad HD LCD Monitor (connected via DisplayPort), Ubuntu 14.04.5 LTS x86_64, Kernel 4.4.0-47-generic (also tested with 3.19.74).

All tests run with Linux drivers 367.57 and 375.20. Nouveau blacklisted and no modules running:

rab-wksta:~$ sudo lsmod | grep nv
nvidia_drm 53248 1
nvidia_modeset 790528 5 nvidia_drm
nvidia 11911168 81 nvidia_modeset
drm_kms_helper 151552 1 nvidia_drm
drm 360448 4 drm_kms_helper,nvidia_drm

Note that there is nothing in syslog or dmesg. Apparently the hard freeze is sudden and there is no opportunity to log anything.

  • 2 - Reproduce

I can reproduce the lockup consistently by using cudaNN. I have the Cuda toolkit installed along with Tensorflow. If I attempt to run a test on Tensorflow the entire machine will lock up after a minute or two. I am running the Tensorflow test ./tensorflow/models/image/mnist.

I have an NVidia monitor onscreen - the temperature never exceeds 53degC. Even the GPU clock and memory clock are well below full throttle.

This occurs with both driver versions 367.57 and 375.20, and kernels from 3.19.x to 4.4.0-47.

  • 3 - Is it just when the card is being used to its fullest capability?

I can run the Unigine_Heaven-4.0 benchmark on the graphics card and after 30 minutes the machine has still not locked up. The clock and temp are quite a bit higher than when running the cudaNN job.

  • 4 - Hardware or Linux drivers? What about under Windows 10?

Because I wanted to ensure that there were no hardware issues, I performed the following test:

  • I disconnected my machine’s drives (Linux OS)
  • I attached a new hard drive
  • I installed Windows 10
  • I installed the latest versions of the NVIDIA drivers, Cuda Toolkit, cuDNN, and tensorflow
  • I ran the same tensorflow sample 3 consecutive times (model/images/mnist)

The machine completed the tests all 3 times without a hitch. I have never been able to complete the test on Linux.

  • 5 - Problem with specific card or not?

At my own expense, I purchased the following card:

ZOTAC GeForce GTX 1050 Ti OC Edition 4GB GDDR5 128-bit DL-DVI Graphic Card (ZT-P10510B-10L)

Running drivers 375.20, I was able to reproduce the problem. The training never completed and I had a hard system freeze.
nvidia-bug-report.log.gz (118 KB)

I’d love to get some sort of acknowledgement that someone at least saw this and took note. It shouldn’t be too hard to duplicate.

It’s unfortunate, but I had to return one of the cards and replace it with an AMD. I can live without doing Cuda immediately, but I can’t live with being unable to reliably join business meetings on Hangouts without my computer freezing.

Happens every time and I end up running to get my laptop. Really unacceptable. Would really like some response on this thread.

You started your thread just before the weekend so hopefully a moderator will find it soon.

In the meantime I trust that you installed intel-microcode along with Ubuntu 14.04.5 LTS?

Is your Supermicro X7-DWA-N’s BIOS up-to-date?

Is the RAM which you are using qualified for use with that motherboard?

Have you tested the RAM with Memtest86+ 5.01?

Have you tried running that cudaNN job after temporarily removing any non-essential expansion cards, peripherals and drives?

That your rig freezes shortly after joining a Google Hangouts conference suggests possibly bad Ethernet hardware or improper settings or a conflict with something else in the machine.

Have you tried clearing the motherboard’s CMOS after unplugging the AC and then tapping the PC case’s power button to flatten the voltages throughout the system?

I appreciate the reply, but I suspect you haven’t read my full post and are basically cutting and pasting standard elementary procedures.

This machine is 6 years old - a server dual socket motherboard that has been running nothing but NVidia GPUs since birth without a problem, updating all Linux software and NVidia drivers regularly. The only change is the Graphics card to the 10 series. In fact, putting an old GTS450 back in works fine (but can’t use the my full monitor resolution).

Further, as stated above, I installed Windows to an external hard drive and ran with the Windows drivers with no problems whatsoever; Hangouts including as well as some pretty serious Cuda jobs.

Yes, the BIOS is up to date, such as it is. Clearing the CMOS implies an IRQ or other similar issue, which would affect drivers under any OS.

Yes, just to be sure I ran the Linux memtest, even though I run huge in-memory jobs all day long that work perfectly, and this rig uses ECC memory, since it’s a server motherboard.

No, it’s not Google Hangouts - in terms of network activity, the few MB/s that is used by Hangouts is a fart in a hurricane. I am a software engineer working on big data projects and I transfer hundreds of GB of data all day long. My network hardware is just fine for all other activities. It’s the video causing the problem.

One additional test; as well as installing Windows separately and testing (no errors), I also created a brand new clean install of Ubuntu 14.04 to ensure there was no cruft and was able to duplicate the issues.

I did read your OP and I’ve no need to cut & paste elementary trouble-shooting procedures since they form part of what I’ve learned about PCs via trial and error over the past ten years or so. As comparatively modest as my accrued computer experience currently is, it has served me well in that my researched amalgam of hardware and software has resulted in a rig which runs trouble free in all regards.

The reason why I asked of you the questions I did is to further flesh out the circumstances of your trouble-shooting check list for the benefit of those who lack an intimate knowledge of what they entail. Thank you for having done so.

Re clearing the CMOS:

This is one of the process-of-elimination tacts which IME has on occasion remedied a vexing and aberrant behavioral issue with some motherboard / expansion card combos. Though I lack the technical understanding of why the following might be so, I suspect that nVidia’s Windows and GNU/Linux drivers may respond differently to any corruption which might be present in the CMOS memory. If not, then thoroughly clearing the CMOS will have at least eliminated it as a potential contributer to the problem at hand.

BTW. Given the professional nature of the motherboard you are using and its vintage is there a reason why you haven’t went with an appropriate Kepler-based Quadro graphics card or another 3440 x 1440-capable GPU based upon the consensus of other computer professionals?

For example while most of the Quadro line resides in a heart-stopping price range, locally the VCQK1200DP-PB is only $56 more than the GEFORCE GTX 1060 ARMOR 3G OCV1:

$ 365.99
SONNAM COMPUTERS - PNY Quadro K1200 Graphic Card - 4 GB GDDR5 - PCI Express 2.0 x16 - Low-profile - Single Slot Space Required - 128 bit Bus Width - 4096 x 2160 - Fan Cooler - OpenGL 4.5, DirectX 12, DirectCompute, OpenCL - 4 x Mini DisplayPort - 4 x Monitors Supported - VCQK1200DP-PB
http://estore.sonnam.com/ShopItem.aspx?SessionCode=CE854AEC9F314CA28353C57E5B10D774&CatCono=136337&ProductNo=5088566&Toc=136337:770291^5^27|770291^5^29|136337^0^607|136337^0^766&PageID=161566140

$ 309.99
SONNAM COMPUTERS - MSI ARMOR GEFORCE GTX 1060 ARMOR 3G OCV1 GeForce GTX 1060 Graphic Card - 1.54 GHz Core - 1.76 GHz Boost Clock - 3 GB GDDR5 - PCI Express 3.0 x16 - 192 bit Bus Width - Fan Cooler - DirectX 12, OpenGL 4.5 - 2 x DisplayPort - 2 x HDMI - 1 x Total Number of DVI (1 x DVI-D) - Dual Link DVI Supported - 4 x Monitors Supported - GEFORCE GTX 1060 ARMOR 3G OCV1
http://estore.sonnam.com/ShopItem.aspx?SessionCode=CE854AEC9F314CA28353C57E5B10D774&CatCono=136337&ProductNo=6543688&Toc=136337:770291^5^27|770291^5^29|136337^0^607|136337^0^766&PageID=161566247

Perhaps I evaluated the workloads and tasks that I needed to perform and choose the best solution based on that.

Perhaps I evaluated the workloads and tasks that I needed to perform and chose the best solution based on that.