Pascal Titan X/1080 leads to frozen machines (Ubuntu 14.04/16.04, 375.20, 375.39, 378.13)

jwu · April 7, 2017, 7:32pm

We purchased some machines with 6/8 GPUs installed. These machines suffer from random freezes since then. If you start running some job using the GPU, especially when the job begins or ends, the machines are likely to freeze. We tried a number of configurations (see below), but doesn’t resolve the issue.

Linux: Ubuntu 14.04, Ubuntu 16.04
Kernel: 3.13.0-68, 4.4.0-66
GPUs: Pascal Titan X, 1080
Nvidia drivers: 375.20, 375.39, 378.13, among a number of others

Error log: http://asia.csail.mit.edu/nvidia/error.log
Nvidia log: http://asia.csail.mit.edu/nvidia/nvidia-bug-report.log.gz

When we generated the nvidia log, it led to a hang, too.

birdie · April 8, 2017, 5:46am

I’d recommend trying to install a better PSU.

If that doesn’t help, please try running your workloads from a text console (Ctrl + Alt + F2-F6) after running

sudo dmesg -n8

. That will allow you to immediately see any kernel messages - perhaps you’re getting a kernel panic.

nryant · April 28, 2017, 10:59pm

Out of curiosity, what do you have these cards installed in? Dell, Lenovo, Supermicro? I’ve experienced similar random hard freezes since last August with a Supermicro GPU server with six 1080s. Matrix products, convolutions, or even just running nvidia-smi can cause the system to lock up with nothing showing in the logs. Like you, I tried multiple driver versions and OSes (Ubuntu 16.04 LTS, openSUSE, CentOS), but all exhibit the same hard freeze behavior.

jwu · April 28, 2017, 11:11pm

Yes it’s supermicro, too.

nryant · May 1, 2017, 4:45pm

What’s the Supermicro SKU? I’m dealing with a 4028GR-TRT GPU server.

sandipt · May 2, 2017, 1:52pm

Hi jwu & nryant,
Your nvidia bug report is not complete. There is no any output related to dmesg, lspci or dmidecode , no nvidia related error in log. What are the minimum number of GPUs working on your system? Did you reported this issue to SuperMicro these number of GPU supports the system or not? Is Pascal Titan X, 1080 supported on this server?

jwu · May 2, 2017, 3:25pm

Hey Sandip,

Thanks for the reply. We have older SuperMicro machines with 4 GPUs each. They seem to work fine. These new machines have 8 slots for GPUs so 8 GPU should be supported I guess? We’re contacting them to see if they support Pascal Titan X and 1080.

At the same time, do you have any ideas how we can get error reports that may contain information about the GPUs? We followed instructions to obtain the above error log and nvidia log, but it seems not useful to you? The machine hanged again when we were getting the nvidia dialog, so it should be a GPU-related error (though maybe not a nvidia-related error).

Jiajun

rlukier · May 2, 2017, 4:59pm

I’m in the same boat. Recently at work we’ve ordered the following machine:

Supermicro SYS-4028GR-TRT2 with latest beta BIOS 2.0b from 04/19/2017
2 x Xeon E5-2640 v4 with 0xb000021 microcode
256 GB ECC DDR4 (tested with memtest86 OK)
8 x GeForce GTX 1080 Ti
Kernel: 4.9.16

We’ve tried both Arch Linux and Gentoo (currently) and under TensorFlow workloads on 381.09 driver the machine freezes at least once every 24h. We are now testing with 378.13.

I’ll think how to prepare the nVidia’s error report, as the machine freezes completely and only with extra Kernel Debugging options there are some stack traces from the nvidia_uvm in the syslog. Without that the only piece of information is what on the console / remote BMC screen:

Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
Shutting down cpus with NMI
Kernel Offset: disabled
Rebooting in 30 seconds…

I hope this is due to some driver bugs, as the hardware is very recent, and can be ironed out relatively soon.

elkeb · June 14, 2017, 7:05am

Hello,
we have the same problem. Did anyone find a solution?

Elke

sandipt · June 14, 2017, 1:41pm

Hi all, What error did you see in log? Please share the issue reproduction steps in detail? Please attach nvidia bug report as soon as issue hit, if not possible generate it after reboot. Enable kernel dump to get crash dump and core. Try with minimum numbers of gpus first to isolate if its gpu hardware issue. I think you can start remote ssh session or check remote for logs to see its os, kernel or driver issue. Please explain problem in a such a way that we can replicate same issue here to investigate. Please share application you are using to trigger this issue.

jwu · June 14, 2017, 5:28pm

Hi Sandip,

I’ve tried very hard to provide every details and to get logs. The setup of the machine was listed above, the commands that caused the error is uncertain, but most likely it happens when a job starts or finishes on a GPU. For example, if you run “require ‘cutorch’” in torch, which initiates a job on a GPU, the machine could freeze.

I’ve attached links to the logs in my first post. The command listed to obtain nvidia bug report itself leads to a crash, so I don’t know what else we can do to obtain a report that contains useful information. I’m not familiar with OS-level operations. Could you point me to the commands that I should run to obtain more information?

jwu · June 30, 2017, 3:45pm

Hi Sandip,

After a crash this morning, we luckily obtained a complete nvidia-bug-report. I’m attaching it below. Would this be helpful?
Link: http://asia.csail.mit.edu/nvidia/nvidia-bug-report-1.log.gz

sandipt · July 3, 2017, 7:36am

Nope. No any nvidia related error in log.

Topic		Replies	Views
Titan V freezes on Ubuntu 14.04 TITAN	0	872	February 18, 2019
Ubuntu 18.04 and RTX 2080 SUPER systematically freezing Linux cuda , tensorflow , ubuntu	27	4098	October 12, 2021
410.66 crash and system freeze under heavy load (Xid 8, Xid 38) Linux	13	2109	November 15, 2018
Ubuntu freezes randomly with Titan Xp Collector's edition CUDA Setup and Installation	17	1873	February 11, 2018
Ubuntu 22.04.1 - Xid 8 error causing X to freeze Linux	3	487	April 26, 2024
Ubuntu 18.04 completely freezes after a few minutes of being booted Linux	25	18681	October 8, 2021
Ubuntu 18.04 freezed when using gpu-burn on RTX2080 Ti Linux	1	1042	December 9, 2019
3X GeForce GTX 1080 ti NVRM: Xid (PCI:0000:01:00): Linux	1	468	January 31, 2019
Random freezes rtx 4070 due to kworker acpi interrupts in ubuntu 6.8 Linux	3	529	May 17, 2024
X server random crash / frozen - 2080 (Ubuntu 16.04.5 - Driver 410.48) Linux	1	1186	December 1, 2018

Pascal Titan X/1080 leads to frozen machines (Ubuntu 14.04/16.04, 375.20, 375.39, 378.13)

Related topics