Xid 79, GPU has fallen off the bus.

I have been loosing GPU’s intermittently when running under high work load for long periods of time. I am running with 5 GPU’s: 4 EVGA GTX 1080’s (1 in tower and 3 in cyclone PCI expansion system) and 1 Quadro. The dmesg output indicates that 3 of 4 EVGA GTX 1080’s are dropping off. It is suspected that they are the GPU’s in the cyclone, but I need to verify this.

Does anyone have any insights or know what Xid 79 indicates? Xid 79 isn’t defined in the documentation!

I have included system information and dmesg output below.

more /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION=“Ubuntu 14.04.4 LTS”

cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.48 Sat Sep 3 18:21:08 PDT 2016
GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3)

dmesg

[155044.078590] NVRM: GPU at PCI:0000:06:00: GPU-0bed67c6-4a38-cce7-d78a-d4451b4479da
[155044.078590] NVRM: GPU Board Serial Number: 0324216021492
[155044.078590] NVRM: Xid (PCI:0000:06:00): 79, GPU has fallen off the bus.
[155044.078590]
[155044.078590] NVRM: GPU at 0000:06:00.0 has fallen off the bus.
[155044.078590] NVRM: GPU is on Board 0324216021492.
[155044.078590] NVRM: A GPU crash dump has been created. If possible, please run
[155044.078590] NVRM: nvidia-bug-report.sh as root to collect this data before
[155044.078590] NVRM: the NVIDIA kernel module is unloaded.
[155044.078590] NVRM: GPU at PCI:0000:07:00: GPU-e6b9f0b3-5ca4-e81f-30e9-a44511c67fd2
[155044.078590] NVRM: GPU Board Serial Number: 0324316266492
[155044.078590] NVRM: Xid (PCI:0000:07:00): 79, GPU has fallen off the bus.
[155044.078590]
[155044.078590] NVRM: GPU at 0000:07:00.0 has fallen off the bus.
[155044.078590] NVRM: GPU is on Board 0324316266492.
[155044.078590] NVRM: GPU at PCI:0000:09:00: GPU-c0c23e20-3dea-79f6-a2b1-d2c0e63afb3e
[155044.078590] NVRM: GPU Board Serial Number: 0324216021748
[155044.078590] NVRM: Xid (PCI:0000:09:00): 79, GPU has fallen off the bus.
[155044.078590]
[155044.078590] NVRM: GPU at 0000:09:00.0 has fallen off the bus.
[155044.078590] NVRM: GPU is on Board 0324216021748.

Relevant thread:
https://devtalk.nvidia.com/default/topic/985037/gtx-1070-quot-gpu-has-fallen-off-the-bus-quot-running-3d-games-in-arch-linux-/

Under Linux, compile & run gpu-burn – note this will stress GPUs close to 100%, so make sure the power supply is enough to handle the full load and enough ventilation is present. If a GPU errors during this test it is more than likely the GPU itself.

If the fall off the bus results are reproducible at full load, gpu-burn program will more than likely trigger it within 5 or 10 minutes of running. Given your configuration of an external PCI-E device, it is quite possible that either the cabling, interface card, or even the PCI-E slot that the interface card is connected to could be bad, or even that you do not have enough power in the expansion slot for all the GPUs.

If power is not the issue, my advice would be to test each GPU separately physically in the machine to rule out the expansion solution.

FWIW, I had a different Xid error (32) that was caused (most likely) by a bad PLX chip or slot on an ASRock motherboard: http://forum.asrock.com/forum_posts.asp?TID=271&PID=14414&title=asrock-x99-wse-memory-compatiblity#14414

If you are able to test on Windows, Unigine benchmarks tend to trigger hangs as well. That can help debug if it is an O/S type issue or if it is an actual hardware issue.

“GPU has fallen off the bus” is one of those errors that is really difficult to diagnose remotely, as it can have many root causes, including defective hardware. Vacaloca already pointed to insufficient power supply as one possible reason.

PSUs should be dimensioned such that the total combined wattage of all power system components does not exceed 60% of the PSU’s power rating. I would suggest PSUs rated 80 PLUS Platinum, they are more efficient and typically use higher-quality components. Make sure all PCIe-power connectors are properly connected (fully inserted, locking tab engaged). Do not use Y-splitters or 6-pin to 8-pin converters in any of the PCIe-power cables supplying the GPUs. Make sure GPUs are inserted properly into their PCIe slots and are mechanically secured via their bracket to avoid mechanical strain on the PCIe connector.

The use of a PCIe expander is a red flag to me, as this usually has a negative impact on the PCIe signal integrity. Make sure all relevant connectors are clean, fully inserted, and mechanically secured. Avoid any kind of mechanical strain on the connectors, and avoid vibrations (from spinning storage, fans, etc).

I read “GPU has fallen off the buss” ;)

I had the same error on my computer:
Ubuntu 18.04 LTS
RTX 2070

Under Windows 10, no problem.
I finally solve it by forcing PowerMizer setting to Prefer Maximum Performance.
You can add in Startup Application Preferences this command line:
/usr/bin/nvidia-settings -a ‘[gpu:0]/GPUPowerMizerMode=1’

Setting PowerMizer to Prefer Maximum Performace does not work for me. But the situation is similar otherwise. Everything works under Win 10, I have no problem with heavy rendering on GPU or furmark, but whe trying to run modern games, GPU is falling off the bus.

Hi!

I’m having this same issue with:

uname -r: 4.19.28-1-MANJARO

I tried installing 415.27 drivers from NVIDIA and with 418.43, disabled OpenGL flipping and UBB as suggested in NVIDIA page, tried with:

/usr/bin/nvidia-settings -a '[gpu:0]/GPUPowerMizerMode=1'

with no luck, is there any driver issue, it is a hardware issue?

https://www.asus.com/es/Graphics-Cards/ROG-STRIX-RTX2070-O8G-GAMING/ ==> My graphics card
https://www.corsair.com/es/es/Categor%C3%ADas/Productos/Unidades-de-fuentes-de-alimentaci%C3%B3n/Unidades-de-fuentes-de-alimentaci%C3%B3n-avanzadas/RMx-Series/p/CP-9020093-NA ==> My power supply.

Is there anything I can do to fix it?

Thanks in advance.

It was a faulty GPU in my case. Solution was delayed because at first attempt it ended up as “problem with Linux”. During Christmas holiday I was able to crash my computer with Windows and last week I received a replacement GPU, now it works flawlessly.

Did you read #2 and #3 above? Double check on all potential issues mentioned there.

An 850W power supply is certainly sufficient to supply a single RTX 2070, provided you hook it up correctly. Based on vendor specifications (https://www.asus.com/us/Graphics-Cards/ROG-STRIX-RTX2070-O8G-GAMING/specifications/), your GPU features one 8-pin PCIe power connector (up to 150W) plus one 6-pin PCIe power connector (up to 75W). So you should have two separate PCIe power cables hooked up to these connectors. In particular, make sure the 8-pin connector is connected to a PSU output capable of supplying 150W.

Side remark: As far as I am aware, STRIX refers to a line of aggressively vendor-overclocked GPUs. While these may be fine for gaming, I usually advise against using such cards for CUDA work, until such time that the vendors demonstrate transparently and convincingly that they have validated their overclocked models with relevant compute applications.

Hi!

Thanks for your quick answer, yes I saw 2 and 3 above, I have 2 PCIe power cables plugged into my graphics card, and both are connected to the PCIe on the power supply unit, don’t know how to behave right now, but if you can guarantee that there is no issue with the drivers I will rebuild my computer this weekend and see if it helps…

Thanks again.

To be perfectly clear: I am not in a position to guarantee anything.

Note that defect hardware has been mentioned multiple times as one possible root cause in this thread. Note also that if you put together your own system, there is the possibility that you could be the one damaging the hardware, for example by causing mechanical damage to connectors when plugging together components or via electrostatic discharge (because of insufficient grounding) zapping electronic components.

Yeah, is not my first time plugging computers, however when everything works everything works pretty fine, and it started a few days ago, didn’t happened before (the computer have around 3 weeks by now and I’m experiencing the issue for the last week) Seems that opening the box helps (may be overheating), If I found something important I’ll post it here!

Thanks!!

Note that the open-fan cooling typically used for the RTX line of GPUs requires adequate case cooling. The blower designs typically used on previous GPU lines would exhaust hot air from the case through holes in the slot bracket. Open-fan designs basically just mix the air inside the case. This is especially a problem when air flow around the GPU is restricted, e.g. by two GPUs in close proximity.

However, all NVIDIA GPUs include thermal caps to prevent overheating (e.g. 83 deg Celsius for my Pascal-based GPU), and I do not recall any instance where heat would trigger falling off the bus. That does not mean it couldn’t happen, of course. If you suspect heat being an issue, monitor it with nvidia-smi or a tool like TechPowerUp’s GPU-Z.

One possibility: the GPU isn’t properly secured at the bracket (screw, clamp) and it is slowly wriggling it’s way out of the PCIe socket, possibly under the influence of mechanical vibrations from other system components or the environment.

Hey! I will plug again the GPU ASAP until then I’ll be monitoring the system as you purpose here, don’t know, howeverm what vibrations may be unplugging the GPU, Really appreciated!

Thanks.