1080 Ti always dies shortly after strarting training, cuda 11.5, driver 495.29.05

sapor · January 30, 2022, 2:20pm

Dear Nvidia,

We keep running into the following problem. When we start training a neural network on 1080 Ti, it goes on for some time, and then suddenly gpu dies.

Last time it happened, the observed temperature of the gpu was 57C, just some seconds before the crash.
We checked nvidia-bug-report (attached,
nvidia-bug-report.log.gz (539.6 KB)
), and found the following errors:

/var/log/kern.log:
Jan 30 16:09:28 cluster63 kernel: [168701.407437] NVRM: GPU at PCI:0000:62:00: GPU-0788ba91-4dac-e984-c466-ef683ae29dc0
Jan 30 16:09:28 cluster63 kernel: [168701.407443] NVRM: Xid (PCI:0000:62:00): 79, pid=0, GPU has fallen off the bus.
Jan 30 16:09:28 cluster63 kernel: [168701.407448] NVRM: GPU 0000:62:00.0: GPU has fallen off the bus.
Jan 30 16:09:28 cluster63 kernel: [168701.407490] NVRM: GPU 0000:62:00.0: GPU serial number is .
Jan 30 16:09:28 cluster63 kernel: [168701.407517] NVRM: A GPU crash dump has been created. If possible, please run
Jan 30 16:09:28 cluster63 kernel: [168701.407517] NVRM: nvidia-bug-report.sh as root to collect this data before
Jan 30 16:09:28 cluster63 kernel: [168701.407517] NVRM: the NVIDIA kernel module is unloaded.

However, it seems impossible that the gpu has fallen of the bus - it was reconnected a day before, and since then no one entered the computer room (you may found a similar problem dated Jan 27 in the logs, that time it was indeed a connection problem). May it be the reported gpu crash that caused the error “gpu has fallen of the bus”?

Moreover, we had a similar problem with RTX 3090, which has been solved by updating drivers and cuda. This makes me think that the problem is with the drivers, rather then with the hardware.

Can you please inspect the problem?
Are the drivers we are using incompatable with 1080 Ti?

Thank you in advance,
Ivan.

MarkusHoHo · January 31, 2022, 4:04pm

Hello Ivan,

First of all, the newest drivers are still compatible with a 1080TI, no worries.

But I can see that there is still some remnant from the 470.57 version of the driver:
[ 17.891] (II) NVIDIA GLX Module 470.57.02 Tue Jul 13 16:10:58 UTC 2021

So first of all you should make sure you have a clean driver installation. For that you should purge any drivers from the system and do a fresh re-install. You can find details on how to do that in the README that comes with the Linux driver.

Can you share what OS you are using for your setup?

A few other things to look out for, based on the information I found in the log:

It seems you have two 3090s and one 1080 TI on your server board. Check if there is sufficient cooling and sufficient power supply for the system.
The 57C are not problematic as such, but the Xid 79 error most often indicates either power or temperature issues. The 3090 alone has a PSU recommendation of minimum 650 W in a desktop setting. On an EPYC system with a second 3090 and a 1080 TI this should be at least 1500W if not more. In addition to that, neither 3090 nor 1080TI are specified for Server usage.
Check if there is a new BIOS for the board
Are you using secure boot? If so, the GPUs will need key certification to work correctly. If in doubt you can disable secure boot
Make sure you are using nvidia-persistenced to ensure diver persistence across CUDA job runs.

If this does not solve your issues, I suggest searching in our Linux Category, there are a lot of prior solutions to very similar issues.

I hope this helps!

sapor · January 31, 2022, 9:26pm

Dear Markus,

Thank you very much for your help!
We will follow your guidance to ensure that the system is properly configured.

Regarding your question, we are working under Ubuntu 20.04.3 LTS, kernel 5.4.0-96-generic, x86-64.

Best ragards,
Ivan.

Topic		Replies	Views
GTX 1080 Ti falling off bus Linux	19	2690	September 3, 2018
GTX 1080Ti keeps crashing while under CUDA load and "disappears" from the system until reboot Linux	1	758	January 16, 2019
2080Ti got ERR soon after starting DL training Linux	11	1932	February 2, 2019
GPU is lost, all GPU card fans on, 1080 Ti, Ubuntu 16.04. Linux	2	5599	January 3, 2018
Hard crash using CUDA on GTX 1080 Ti on Ubuntu 16.04 CUDA Setup and Installation	8	5008	September 25, 2017
Tesla K10 "has fallen off the bus" Linux	5	3317	May 13, 2013
Xid 8 in various CUDA deep learning applications for Nvidia GTX 1080 Ti Linux	5	7777	October 24, 2018
2080 Ti "fallen off the bus" on ubuntu 18.04 Linux	1	515	August 14, 2019
410.78 driver, GPUs will lock up Linux	7	2906	March 29, 2019
Ubuntu 20.04 - RTX3090 - GPU has fallen off the bus Linux cuda , tensorflow , ubuntu , linux	5	4437	December 12, 2021

1080 Ti always dies shortly after strarting training, cuda 11.5, driver 495.29.05

Related topics