Pascal Titan X/1080 leads to frozen machines (Ubuntu 14.04/16.04, 375.20, 375.39, 378.13)

We purchased some machines with 6/8 GPUs installed. These machines suffer from random freezes since then. If you start running some job using the GPU, especially when the job begins or ends, the machines are likely to freeze. We tried a number of configurations (see below), but doesn’t resolve the issue.

Linux: Ubuntu 14.04, Ubuntu 16.04
Kernel: 3.13.0-68, 4.4.0-66
GPUs: Pascal Titan X, 1080
Nvidia drivers: 375.20, 375.39, 378.13, among a number of others

Error log: http://asia.csail.mit.edu/nvidia/error.log
Nvidia log: http://asia.csail.mit.edu/nvidia/nvidia-bug-report.log.gz

When we generated the nvidia log, it led to a hang, too.

I’d recommend trying to install a better PSU.

If that doesn’t help, please try running your workloads from a text console (Ctrl + Alt + F2-F6) after running

sudo dmesg -n8

. That will allow you to immediately see any kernel messages - perhaps you’re getting a kernel panic.

Out of curiosity, what do you have these cards installed in? Dell, Lenovo, Supermicro? I’ve experienced similar random hard freezes since last August with a Supermicro GPU server with six 1080s. Matrix products, convolutions, or even just running nvidia-smi can cause the system to lock up with nothing showing in the logs. Like you, I tried multiple driver versions and OSes (Ubuntu 16.04 LTS, openSUSE, CentOS), but all exhibit the same hard freeze behavior.

Yes it’s supermicro, too.

What’s the Supermicro SKU? I’m dealing with a 4028GR-TRT GPU server.

Hi jwu & nryant,
Your nvidia bug report is not complete. There is no any output related to dmesg, lspci or dmidecode , no nvidia related error in log. What are the minimum number of GPUs working on your system? Did you reported this issue to SuperMicro these number of GPU supports the system or not? Is Pascal Titan X, 1080 supported on this server?

Hey Sandip,

Thanks for the reply. We have older SuperMicro machines with 4 GPUs each. They seem to work fine. These new machines have 8 slots for GPUs so 8 GPU should be supported I guess? We’re contacting them to see if they support Pascal Titan X and 1080.

At the same time, do you have any ideas how we can get error reports that may contain information about the GPUs? We followed instructions to obtain the above error log and nvidia log, but it seems not useful to you? The machine hanged again when we were getting the nvidia dialog, so it should be a GPU-related error (though maybe not a nvidia-related error).

Jiajun

I’m in the same boat. Recently at work we’ve ordered the following machine:

Supermicro SYS-4028GR-TRT2 with latest beta BIOS 2.0b from 04/19/2017
2 x Xeon E5-2640 v4 with 0xb000021 microcode
256 GB ECC DDR4 (tested with memtest86 OK)
8 x GeForce GTX 1080 Ti
Kernel: 4.9.16

We’ve tried both Arch Linux and Gentoo (currently) and under TensorFlow workloads on 381.09 driver the machine freezes at least once every 24h. We are now testing with 378.13.

I’ll think how to prepare the nVidia’s error report, as the machine freezes completely and only with extra Kernel Debugging options there are some stack traces from the nvidia_uvm in the syslog. Without that the only piece of information is what on the console / remote BMC screen:

Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
Shutting down cpus with NMI
Kernel Offset: disabled
Rebooting in 30 seconds…

I hope this is due to some driver bugs, as the hardware is very recent, and can be ironed out relatively soon.

Hello,
we have the same problem. Did anyone find a solution?

Elke

Hi all, What error did you see in log? Please share the issue reproduction steps in detail? Please attach nvidia bug report as soon as issue hit, if not possible generate it after reboot. Enable kernel dump to get crash dump and core. Try with minimum numbers of gpus first to isolate if its gpu hardware issue. I think you can start remote ssh session or check remote for logs to see its os, kernel or driver issue. Please explain problem in a such a way that we can replicate same issue here to investigate. Please share application you are using to trigger this issue.

Hi Sandip,

I’ve tried very hard to provide every details and to get logs. The setup of the machine was listed above, the commands that caused the error is uncertain, but most likely it happens when a job starts or finishes on a GPU. For example, if you run “require ‘cutorch’” in torch, which initiates a job on a GPU, the machine could freeze.

I’ve attached links to the logs in my first post. The command listed to obtain nvidia bug report itself leads to a crash, so I don’t know what else we can do to obtain a report that contains useful information. I’m not familiar with OS-level operations. Could you point me to the commands that I should run to obtain more information?

Hi Sandip,

After a crash this morning, we luckily obtained a complete nvidia-bug-report. I’m attaching it below. Would this be helpful?
Link: http://asia.csail.mit.edu/nvidia/nvidia-bug-report-1.log.gz

Nope. No any nvidia related error in log.