1080 Ti always dies shortly after strarting training, cuda 11.5, driver 495.29.05

Dear Nvidia,

We keep running into the following problem. When we start training a neural network on 1080 Ti, it goes on for some time, and then suddenly gpu dies.

Last time it happened, the observed temperature of the gpu was 57C, just some seconds before the crash.
We checked nvidia-bug-report (attached,
nvidia-bug-report.log.gz (539.6 KB)
), and found the following errors:

/var/log/kern.log:
Jan 30 16:09:28 cluster63 kernel: [168701.407437] NVRM: GPU at PCI:0000:62:00: GPU-0788ba91-4dac-e984-c466-ef683ae29dc0
Jan 30 16:09:28 cluster63 kernel: [168701.407443] NVRM: Xid (PCI:0000:62:00): 79, pid=0, GPU has fallen off the bus.
Jan 30 16:09:28 cluster63 kernel: [168701.407448] NVRM: GPU 0000:62:00.0: GPU has fallen off the bus.
Jan 30 16:09:28 cluster63 kernel: [168701.407490] NVRM: GPU 0000:62:00.0: GPU serial number is .
Jan 30 16:09:28 cluster63 kernel: [168701.407517] NVRM: A GPU crash dump has been created. If possible, please run
Jan 30 16:09:28 cluster63 kernel: [168701.407517] NVRM: nvidia-bug-report.sh as root to collect this data before
Jan 30 16:09:28 cluster63 kernel: [168701.407517] NVRM: the NVIDIA kernel module is unloaded.

However, it seems impossible that the gpu has fallen of the bus - it was reconnected a day before, and since then no one entered the computer room (you may found a similar problem dated Jan 27 in the logs, that time it was indeed a connection problem). May it be the reported gpu crash that caused the error “gpu has fallen of the bus”?

Moreover, we had a similar problem with RTX 3090, which has been solved by updating drivers and cuda. This makes me think that the problem is with the drivers, rather then with the hardware.

Can you please inspect the problem?
Are the drivers we are using incompatable with 1080 Ti?

Thank you in advance,
Ivan.

Hello Ivan,

First of all, the newest drivers are still compatible with a 1080TI, no worries.

But I can see that there is still some remnant from the 470.57 version of the driver:
[ 17.891] (II) NVIDIA GLX Module 470.57.02 Tue Jul 13 16:10:58 UTC 2021

So first of all you should make sure you have a clean driver installation. For that you should purge any drivers from the system and do a fresh re-install. You can find details on how to do that in the README that comes with the Linux driver.

Can you share what OS you are using for your setup?

A few other things to look out for, based on the information I found in the log:

  • It seems you have two 3090s and one 1080 TI on your server board. Check if there is sufficient cooling and sufficient power supply for the system.
    The 57C are not problematic as such, but the Xid 79 error most often indicates either power or temperature issues. The 3090 alone has a PSU recommendation of minimum 650 W in a desktop setting. On an EPYC system with a second 3090 and a 1080 TI this should be at least 1500W if not more. In addition to that, neither 3090 nor 1080TI are specified for Server usage.
  • Check if there is a new BIOS for the board
  • Are you using secure boot? If so, the GPUs will need key certification to work correctly. If in doubt you can disable secure boot
  • Make sure you are using nvidia-persistenced to ensure diver persistence across CUDA job runs.

If this does not solve your issues, I suggest searching in our Linux Category, there are a lot of prior solutions to very similar issues.

I hope this helps!

Dear Markus,

Thank you very much for your help!
We will follow your guidance to ensure that the system is properly configured.

Regarding your question, we are working under Ubuntu 20.04.3 LTS, kernel 5.4.0-96-generic, x86-64.

Best ragards,
Ivan.