GTX 1080 Ti falling off bus

mokkaa · August 21, 2018, 12:30pm

Hi,

I’m using an EVGA 1080 Ti Hybrid in a Linux 18.04.1 LTS System, with kernel Version 4.15.0-32-generic.

The NVIDIA Driver is at version 396.44, CUDA Version is 9.1 and cuDNN is at version 7.1.4.

Whenever I run keras/tensorflow programs the execution halts randomly (at different points into the computation). The GPU reports loads around 30% and temperatures around 33°C when the programs are running. After the programs halting nvidia-smi gives the following message:

Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost.  Rebootthe system to recover this GPU

dmesg says:

[  218.621720] NVRM: GPU at 00000000:01:00.0 has fallen off the bus.
[  218.621721] NVRM: GPU is on Board .
[  218.621730] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.

The nvidia-bug-report can be found here
https://www.dropbox.com/s/enkkw3mm9daqevb/nvidia-bug-report.log.gz?dl=0

The system is otherwise stable and shows now issues. The PSU provides ample power overhead. Cable connectors internally are secure.

The issue arises with a monitor connected as well as without. In any case I’m accessing the machine over network with TurboVNC and VirtualGL. The connection over VNC remains stable, however with a monitor attached the login into the system is no longer possible, after the GPU has fallen off the bus.

I’ve found a couple of reports of this problem with the same GPU, none of which provide a solution.

I’d be grateful for any help.

Thanks,
Moritz

generix · August 21, 2018, 12:59pm

Reseat, change slot. Check card in another system.

generix · August 21, 2018, 1:01pm

…if using risers, throw away, buy better ones.

mokkaa · August 21, 2018, 1:14pm

Thanks for the suggestions. I’m now trying a Titan Z in the same system instead of the 1080 Ti, will report back whether I get the same issues.

Still open for other ideas as well.

evenmatth · August 21, 2018, 6:51pm

I have same issue. For me the issue seems to appear when I train on multiple GPUs (using different processes). Or at least, then the issue appears fairly quickly (after a few seconds). I havent tried running a single GPU for long time, so I dont know if it would crash if I did that.

My kernel version is the same, but running 16.04 and different NVIDIA driver.

generix · August 21, 2018, 8:03pm

evenmath, your issue sounds rather like insufficient power supply.

mokkaa · August 22, 2018, 6:15am

evenmath, I’m using a single GPU setup, indeed this might be a PSU related issue.

With the Titan Z it appears to be working fine. It was running all night, without any issues. I guess that may imply that either the 1080 Ti is somehow broken, or there are driver problems with that specific board. Any ideas?

evenmatth · August 22, 2018, 6:45am

kampelmuehler6lkdd, generix.

I actually suspected that it might be power issue since it was allways the same two GPUS that failed (but only one at a time).

So tonight I tried to run on only 2 GPUs instead of 3 (which was what I had running before and which failed immediately). The 2 GPUs I ran on tonight were connected to 2 different Power supplies. The GPU that finally failed, after a couple of hours, was a one of the GPUs that had failed yesterday, but it was the only device connected to a 1000W power supply.

It is the same power supply that I used for the 2 GPUs that failed earlier, so it might be an issue with that power supply, but it seems weird since its at most utilizing 25% of the PSU.

I would try update kernel if I could, but I really dont have time until next week to test things.

generix · August 22, 2018, 7:52am

Just to save you time: the XID 79 you’re getting is almost always connected with hardware issues. The only software/kernel related issue I know of which leads to an XID 79 is resume from suspend which doesn’t apply here. So upgrading/downgrading driver/kernel won’t lead you anywhere apart from sheer coincidence.

Check if the psu has an ‘eco’ switch.
Check the docs of the psu regarding power/temperature

mokkaa · August 22, 2018, 9:33am

I strongly suspect the 1080 Ti might be broken, since on the exact same port with the same powerlines the Titan Z doesn’t show any issues when stress testing, whereas the 1080 Ti fails after a couple of secondes (with temperatures staying in the low 30s though).

evenmatth · August 23, 2018, 5:40am

Thanks for info. Now I have “stress tested” my gpus in the sense that I have put high load neural networks on all on them. I have managed to make all of three of the once connected to the 1000w psu fail. It is a Corsair RM1000i, and I could not find any thing special about the issue, neither anything about an “Eco switch”, so I will try to order anew one to see if it works.

Thanks for help!

Will let you know when my new one have arrived!

mokkaa · August 23, 2018, 7:23am

Sounds about right. I’m using gpu-burn to stress test on linux [url]https://github.com/wilicc/gpu-burn[/url], it’s a quite comfortable choice!

evenmatth · August 23, 2018, 10:22am

Haha yes, that scrashed my whole computer so I guess it did its job! Will see if it works better after new PSU.

evenmatth · August 27, 2018, 10:45pm

An update:

My setup was 4 1080ti plus cpu and motherboard.
I had a 650W connected to 1 GPU + motherboard.
I had one 1000W connected to 3 GPUs.

That was what have been working for a long time, but suddenly started failing (although it was a few months since I put high load on all of them at once).

It seemed that it was always the 3 gpus that was connected to the 1000W that failed.

Based on the suggestions I tried another PSU, a 750W. Now I only ran 650W with 1 GPU + mb, and the 750W with 2 GPU (total of 3 GPUs). Still failed.

1000W psu connected to mb + 2 GPU, and the 750W PSU connected to 2 GPUs failed.

This worked: 1000W connected to mb +2 GPU, and 650W 1 GPU, 750W 1 GPU.

Super weird, or im very unlucky that both 650W and 750W are a little broken… since I dont think it should fail if I run 750W with only 2 GPU.

Btw, most of the time when it failed my whole computer crashed and rebooted. But sometimes I could see which GPU it was that failed, e.g. with nvidia-smi.

mokkaa · August 28, 2018, 6:33am

I also still can’t make any sense of this.

Both Titan X and Titan Z worked just fine in the same setup, where the 1080Ti fails. If I swap the 1080Ti over to another PCIe slot it seems to be stable.

The same 1080Ti is also causing random crashes in another machine.

I’ll see how long it keeps going stable on the other PCIe slot. So far it didn’t show any signs of weakness after 17h @ 100%.

generix · August 28, 2018, 8:44am

evenmatth, you’ll have to take into account that up to 75W/GPU is drawn over the pcie bus, means the mb psu. Peak power draw of a standard Ti is about 300W, Bus+6pin+8pin. OC models take even more.
mokkaa, from my observations, Pascal type gpus are much more sensitive to flaws in pcie connection than previous models. Did you check if the slot the card is working now is a x16 slot?

mokkaa · August 29, 2018, 6:15am

Yep, all the PCIe slots are x16 (it’s an ASUS X99-DELUXE MoBo)

disnel · September 2, 2018, 9:20am

I have similar problem: GPU fallen off the bus while running more demanding demos or games - Linux - NVIDIA Developer Forums

And, in my opinion, it is some driver related problem, because I don’t have this problem under Windows, even with the same games. I also don’t have any problems when rendering in Blender. Neither with GPU mining (tried Monero for an hour, or so).

evenmatth · September 2, 2018, 10:20am

Okey.

Is there any similar program like gpu-burn for windows? I have dual boot so should be able to try there. Just did not check this d/t the comment by generix: “Just to save you time: the XID 79 you’re getting is almost always connected with hardware issues”.

generix, how sure are you about that comment?

generix · September 3, 2018, 10:42am

I’m very sure about that comment, but maybe let me extend that a little bit. From the point of view of the gpu, the reasons are power failure, overheating, pcie failure, i.e. hardware. Of course, all of those can still be software related, pcie=chipset/pcie driver, overheating=failure of system fan control, power=acpi issues, etc. yet, this is all not nvidia driver related and system specific.

Topic		Replies	Views
1080 Ti always dies shortly after strarting training, cuda 11.5, driver 495.29.05 Drivers - Linux, Windows, MacOS cuda	2	757	January 31, 2022
2080 Ti "fallen off the bus" on ubuntu 18.04 Linux	1	456	August 14, 2019
Gpu has fallen off the bus Ubuntu 18.04 Linux kernel , ubuntu	13	1596	February 15, 2021
"GPU has fallen off the bus" on GTX 1070 Linux	38	24149	April 5, 2021
Frequent crashes / hangs with message "GPU has fallen off the bus" Linux	8	1175	June 20, 2024
GPU keeps falling off the bus Linux	3	1378	September 4, 2019
Xid 79, GPU has fallen off the bus. CUDA Programming and Performance	15	26330	August 13, 2023
GeForce GTX 1060 reliably falls of the bus Linux cuda , tensorflow , ubuntu	1	538	May 19, 2020
GPU has fallen off the bus GPU - Hardware kernel , linux , rtx	1	806	May 21, 2024
GPU Sporadically Falls Off Bus During Tensorflow Training Linux	2	621	May 3, 2021

GTX 1080 Ti falling off bus

Related topics