GTX 1080 Ti falling off bus

Hi,

I’m using an EVGA 1080 Ti Hybrid in a Linux 18.04.1 LTS System, with kernel Version 4.15.0-32-generic.

The NVIDIA Driver is at version 396.44, CUDA Version is 9.1 and cuDNN is at version 7.1.4.

Whenever I run keras/tensorflow programs the execution halts randomly (at different points into the computation). The GPU reports loads around 30% and temperatures around 33°C when the programs are running. After the programs halting nvidia-smi gives the following message:

Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost.  Rebootthe system to recover this GPU

dmesg says:

[  218.621720] NVRM: GPU at 00000000:01:00.0 has fallen off the bus.
[  218.621721] NVRM: GPU is on Board .
[  218.621730] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.

The nvidia-bug-report can be found here
https://www.dropbox.com/s/enkkw3mm9daqevb/nvidia-bug-report.log.gz?dl=0

The system is otherwise stable and shows now issues. The PSU provides ample power overhead. Cable connectors internally are secure.

The issue arises with a monitor connected as well as without. In any case I’m accessing the machine over network with TurboVNC and VirtualGL. The connection over VNC remains stable, however with a monitor attached the login into the system is no longer possible, after the GPU has fallen off the bus.

I’ve found a couple of reports of this problem with the same GPU, none of which provide a solution.

I’d be grateful for any help.

Thanks,
Moritz

Reseat, change slot. Check card in another system.

…if using risers, throw away, buy better ones.

Thanks for the suggestions. I’m now trying a Titan Z in the same system instead of the 1080 Ti, will report back whether I get the same issues.

Still open for other ideas as well.

I have same issue. For me the issue seems to appear when I train on multiple GPUs (using different processes). Or at least, then the issue appears fairly quickly (after a few seconds). I havent tried running a single GPU for long time, so I dont know if it would crash if I did that.

My kernel version is the same, but running 16.04 and different NVIDIA driver.

evenmath, your issue sounds rather like insufficient power supply.

evenmath, I’m using a single GPU setup, indeed this might be a PSU related issue.

With the Titan Z it appears to be working fine. It was running all night, without any issues. I guess that may imply that either the 1080 Ti is somehow broken, or there are driver problems with that specific board. Any ideas?

kampelmuehler6lkdd, generix.

I actually suspected that it might be power issue since it was allways the same two GPUS that failed (but only one at a time).

So tonight I tried to run on only 2 GPUs instead of 3 (which was what I had running before and which failed immediately). The 2 GPUs I ran on tonight were connected to 2 different Power supplies. The GPU that finally failed, after a couple of hours, was a one of the GPUs that had failed yesterday, but it was the only device connected to a 1000W power supply.

It is the same power supply that I used for the 2 GPUs that failed earlier, so it might be an issue with that power supply, but it seems weird since its at most utilizing 25% of the PSU.

I would try update kernel if I could, but I really dont have time until next week to test things.

Just to save you time: the XID 79 you’re getting is almost always connected with hardware issues. The only software/kernel related issue I know of which leads to an XID 79 is resume from suspend which doesn’t apply here. So upgrading/downgrading driver/kernel won’t lead you anywhere apart from sheer coincidence.

  • Check if the psu has an ‘eco’ switch.
  • Check the docs of the psu regarding power/temperature

I strongly suspect the 1080 Ti might be broken, since on the exact same port with the same powerlines the Titan Z doesn’t show any issues when stress testing, whereas the 1080 Ti fails after a couple of secondes (with temperatures staying in the low 30s though).

Thanks for info. Now I have “stress tested” my gpus in the sense that I have put high load neural networks on all on them. I have managed to make all of three of the once connected to the 1000w psu fail. It is a Corsair RM1000i, and I could not find any thing special about the issue, neither anything about an “Eco switch”, so I will try to order anew one to see if it works.

Thanks for help!

Will let you know when my new one have arrived!

Sounds about right. I’m using gpu-burn to stress test on linux https://github.com/wilicc/gpu-burn, it’s a quite comfortable choice!

Haha yes, that scrashed my whole computer so I guess it did its job! Will see if it works better after new PSU.

An update:

My setup was 4 1080ti plus cpu and motherboard.
I had a 650W connected to 1 GPU + motherboard.
I had one 1000W connected to 3 GPUs.

That was what have been working for a long time, but suddenly started failing (although it was a few months since I put high load on all of them at once).

It seemed that it was always the 3 gpus that was connected to the 1000W that failed.

Based on the suggestions I tried another PSU, a 750W. Now I only ran 650W with 1 GPU + mb, and the 750W with 2 GPU (total of 3 GPUs). Still failed.

1000W psu connected to mb + 2 GPU, and the 750W PSU connected to 2 GPUs failed.

This worked: 1000W connected to mb +2 GPU, and 650W 1 GPU, 750W 1 GPU.

Super weird, or im very unlucky that both 650W and 750W are a little broken… since I dont think it should fail if I run 750W with only 2 GPU.

Btw, most of the time when it failed my whole computer crashed and rebooted. But sometimes I could see which GPU it was that failed, e.g. with nvidia-smi.

I also still can’t make any sense of this.

Both Titan X and Titan Z worked just fine in the same setup, where the 1080Ti fails. If I swap the 1080Ti over to another PCIe slot it seems to be stable.

The same 1080Ti is also causing random crashes in another machine.

I’ll see how long it keeps going stable on the other PCIe slot. So far it didn’t show any signs of weakness after 17h @ 100%.

evenmatth, you’ll have to take into account that up to 75W/GPU is drawn over the pcie bus, means the mb psu. Peak power draw of a standard Ti is about 300W, Bus+6pin+8pin. OC models take even more.
mokkaa, from my observations, Pascal type gpus are much more sensitive to flaws in pcie connection than previous models. Did you check if the slot the card is working now is a x16 slot?

Yep, all the PCIe slots are x16 (it’s an ASUS X99-DELUXE MoBo)

I have similar problem: https://devtalk.nvidia.com/default/topic/1030326/linux/gpu-fallen-off-the-bus-while-running-more-demanding-demos-or-games/

And, in my opinion, it is some driver related problem, because I don’t have this problem under Windows, even with the same games. I also don’t have any problems when rendering in Blender. Neither with GPU mining (tried Monero for an hour, or so).

Okey.

Is there any similar program like gpu-burn for windows? I have dual boot so should be able to try there. Just did not check this d/t the comment by generix: “Just to save you time: the XID 79 you’re getting is almost always connected with hardware issues”.

generix, how sure are you about that comment?

I’m very sure about that comment, but maybe let me extend that a little bit. From the point of view of the gpu, the reasons are power failure, overheating, pcie failure, i.e. hardware. Of course, all of those can still be software related, pcie=chipset/pcie driver, overheating=failure of system fan control, power=acpi issues, etc. yet, this is all not nvidia driver related and system specific.