"GPU has fallen off the bus" on GTX 1070

Hello,

I have a recurrent problem with a GTX 1070 which stops randomly during use. Sometimes after 30 seconds, sometimes after 5 minutes, … with an average of 30 minutes. Sometimes it even doesn’t show up on nvidia-smi right after boot and need another restart.

I’m currently using driver version 381.22 but I had the same problem with version 375.66. So updating it doesn’t seem to change anything. Same thing happen with overclocked settings or without.

Motherboard : MSI Z170-A PRO
GPUs : 5*KFA2 GeForce GTX 1070

I don’t know what to do… Help please !

Here is syslog :

Jul  4 19:48:54 lithium kernel: [15826.906938] NVRM: GPU at PCI:0000:06:00: GPU-ad005112-8ee4-06e7-d971-4c1ba8c52cce
Jul  4 19:48:54 lithium kernel: [15826.906948] NVRM: GPU Board Serial Number:
Jul  4 19:48:54 lithium kernel: [15826.906954] NVRM: Xid (PCI:0000:06:00): 79, GPU has fallen off the bus.
Jul  4 19:48:54 lithium kernel: [15826.906954]
Jul  4 19:48:54 lithium kernel: [15826.906960] NVRM: GPU at 0000:06:00.0 has fallen off the bus.
Jul  4 19:48:54 lithium kernel: [15826.906964] NVRM: GPU is on Board .
Jul  4 19:48:54 lithium kernel: [15826.907121] NVRM: A GPU crash dump has been created. If possible, please run
Jul  4 19:48:54 lithium kernel: [15826.907121] NVRM: nvidia-bug-report.sh as root to collect this data before
Jul  4 19:48:54 lithium kernel: [15826.907121] NVRM: the NVIDIA kernel module is unloaded.

I also used to get this warning, but it disappeared when I added “irqpoll” option to grub (but problem remains) :

Jun 30 20:10:22 lithium kernel: [  493.259090] irq 16: nobody cared (try booting with the "irqpoll" option)
Jun 30 20:10:22 lithium kernel: [  493.259094] CPU: 1 PID: 0 Comm: swapper/1 Tainted: P           OE   4.4.0-79-generic #100-Ubuntu
Jun 30 20:10:22 lithium kernel: [  493.259095] Hardware name: MSI MS-7971/Z170-A PRO (MS-7971), BIOS 1.I0 05/02/2017
Jun 30 20:10:22 lithium kernel: [  493.259096]  0000000000000086 27c8eb7098a57078 ffff88019ed03e60 ffffffff813f94d3
Jun 30 20:10:22 lithium kernel: [  493.259098]  ffff880057b12a00 ffff880057b12ad4 ffff88019ed03e88 ffffffff810dde23
Jun 30 20:10:22 lithium kernel: [  493.259100]  ffff880057b12a00 0000000000000000 0000000000000010 ffff88019ed03ec0
Jun 30 20:10:22 lithium kernel: [  493.259101] Call Trace:
Jun 30 20:10:22 lithium kernel: [  493.259102]  <IRQ>  [<ffffffff813f94d3>] dump_stack+0x63/0x90
Jun 30 20:10:22 lithium kernel: [  493.259108]  [<ffffffff810dde23>] __report_bad_irq+0x33/0xc0
Jun 30 20:10:22 lithium kernel: [  493.259109]  [<ffffffff810de1b7>] note_interrupt+0x247/0x290
Jun 30 20:10:22 lithium kernel: [  493.259110]  [<ffffffff810db367>] handle_irq_event_percpu+0x167/0x1d0
Jun 30 20:10:22 lithium kernel: [  493.259112]  [<ffffffff810db40e>] handle_irq_event+0x3e/0x60
Jun 30 20:10:22 lithium kernel: [  493.259113]  [<ffffffff810de729>] handle_fasteoi_irq+0x99/0x150
Jun 30 20:10:22 lithium kernel: [  493.259115]  [<ffffffff8103119d>] handle_irq+0x1d/0x30
Jun 30 20:10:22 lithium kernel: [  493.259117]  [<ffffffff8184345b>] do_IRQ+0x4b/0xd0
Jun 30 20:10:22 lithium kernel: [  493.259119]  [<ffffffff81841542>] common_interrupt+0x82/0x82
Jun 30 20:10:22 lithium kernel: [  493.259119]  <EOI>  [<ffffffff816d4791>] ? cpuidle_enter_state+0x111/0x2b0
Jun 30 20:10:22 lithium kernel: [  493.259122]  [<ffffffff816d4967>] cpuidle_enter+0x17/0x20
Jun 30 20:10:22 lithium kernel: [  493.259124]  [<ffffffff810c4672>] call_cpuidle+0x32/0x60
Jun 30 20:10:22 lithium kernel: [  493.259125]  [<ffffffff816d4943>] ? cpuidle_select+0x13/0x20
Jun 30 20:10:22 lithium kernel: [  493.259127]  [<ffffffff810c4930>] cpu_startup_entry+0x290/0x350
Jun 30 20:10:22 lithium kernel: [  493.259128]  [<ffffffff810517c4>] start_secondary+0x154/0x190
Jun 30 20:10:22 lithium kernel: [  493.259129] handlers:
Jun 30 20:10:22 lithium kernel: [  493.259135] [<ffffffffc101df10>] azx_interrupt [snd_hda_codec]
Jun 30 20:10:22 lithium kernel: [  493.259136] Disabling IRQ #16

nvidia-bug-report before crash : https://www.dropbox.com/s/y2g05f0k2isz90d/nvidia-bug-report.log.gz?dl=0
nvidia-bug-report after crash : https://www.dropbox.com/s/x4d67ntzsriu7cr/nvidia-bug-report-2.log.gz?dl=0

I have a second computer with the exact same config which works like a charm.

nvidia-bug-report.log.gz (313 KB)

As always:

  1. Reseat your GPU or put it in another slot
  2. Update BIOS/reset it to defaults; update GPU BIOS if it’s available
  3. Remove any overclocking; check thermals
  4. Try with another (more powerful) power supply unit

Already did that!

Looks like the only GPU in (PCI:0000:06:00): 79, GPU has fallen off the bus. This can be hardware issue also. Are the other GPU also have this issue? Please check if gpu is overheating. Make sure no any power issue. I think you are running multiple GPUs and specific setting or configuration of your whole setup? Is the issue hit with single or that affected gpu? Lately if there is no hardware issue then we need reproduction steps for this issue.

Thanks for your reply.

Indeed, only this one fall of the bus (and nobody cared, this is not very nice…). No overheating, no power issue. As you can see, he’s cooler than his siblings (due to his position in the case).

Here is the result of nvidia-smi before crash :

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 381.22                 Driver Version: 381.22                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    On   | 0000:01:00.0      On |                  N/A |
| 80%   68C    P2   128W / 129W |   2241MiB /  8113MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1070    On   | 0000:03:00.0      On |                  N/A |
| 80%   68C    P2   126W / 129W |   2226MiB /  8114MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1070    On   | 0000:05:00.0      On |                  N/A |
| 80%   65C    P2   127W / 129W |   2226MiB /  8114MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 1070    On   | 0000:06:00.0      On |                  N/A |
| 80%   60C    P2   126W / 129W |   2226MiB /  8114MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 1070    On   | 0000:08:00.0      On |                  N/A |
| 80%   73C    P2   128W / 129W |   2226MiB /  8114MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1488    G   /usr/lib/xorg/Xorg                              28MiB |
|    0      3409    C   ...ospector/ethminer/build/ethminer/ethminer  2201MiB |
|    1      1488    G   /usr/lib/xorg/Xorg                              13MiB |
|    1      3409    C   ...ospector/ethminer/build/ethminer/ethminer  2201MiB |
|    2      1488    G   /usr/lib/xorg/Xorg                              13MiB |
|    2      3409    C   ...ospector/ethminer/build/ethminer/ethminer  2201MiB |
|    3      1488    G   /usr/lib/xorg/Xorg                              13MiB |
|    3      3409    C   ...ospector/ethminer/build/ethminer/ethminer  2201MiB |
|    4      1488    G   /usr/lib/xorg/Xorg                              13MiB |
|    4      3409    C   ...ospector/ethminer/build/ethminer/ethminer  2201MiB |
+-----------------------------------------------------------------------------+

There is no reproduction steps, it happens randomly and as I said, sometimes it even doesn’t show up on nvidia-smi right after boot and need another restart.

And this problems also happens when I’m not mining.

sometimes it even doesn’t show up on nvidia-smi right after boot.
Does other gpu don’t have this issue? Then looks this specific gpu hardware issue.

I’m gonna look further then let you know.

It’s worth noting that the GPUs are used for mining ;-)

I’d recommend downclocking their RAM/GPU a bit.

I tried different settings with no effect on this problem.

I’m also getting this issue with Gigabyte’s 3-fan 1080 Tis on AsRock PRO BTC+ motherboard and a Celeron processor and 4x PSU mining using USB raisers. Seems like the more GPUs I add to the system, the more likely it is going to crash this way. The mobo is designed to handle 13 GPUs, but, basically, I can’t use even 6 GPUs reliably - they may work some time and then fail like this. If I connect a couple more GPUs at once, usually the result is that they just collectively fail the moment they all start mining.

I wonder if the original poster had any success resolving the issue.

Interesting and, probably, important note: I’ve got Seasonic Platinum Focus 750w PSUs and when the shit hits the fan one of them turns off, I’m not sure why, but probably as a protection from some overcurrent. I’ve tried a different PSU - Aerocool - and it didn’t turn off automatically when the failure happenned, and instead I was hearing a pretty loud sound, like GPU’s fan running at 100%. Not exactly sure which GPU/GPUs, though. Maybe GPU starts to draw power from the raiser instead of PCIE power connectors, I don’t know… Interesting that this loud sound starts at the time the GPU disconnects and persists forward on, I don’t know what GPU’s deciding to do when it’s not connected to the system anymore…

@ihor.ibm
That sounds like a horrible design of supplying power to a lot of GPUs with high power needs. The PSU shutting down should have given you a hint. Redesign that.

I have the same issue with a recent build (actually 2) with the AsRock BTC as well. I am using SMOS. About the same symptoms as ihor.ibm. I am able to run 8 GPUs with no over clocking. I run EVGA GTX 1060s 6GB with Micron memory. I have 2 other rigs on the same mobo running 13 GPUS and they are running flawlessly at 23.5mh/s. I am not sure of the memory in those cards but its either Samsung or Micron. I am guessing perhaps Samsung – as that is the only difference between these rigs. The processors are the same, system memory the same, power supplies the same, SMOS the same. I’m suspecting the Micron memory or some other manufacturing difference is the contributing factor – but I stress that is only speculation at the very best. I will not have an opportunity to check the memory until next week.

When the problem happens on these rigs I do not see any change in fan speeds or any other telltale events other the error message popping on the screen. ( I am running Claymore) on SMOS.

In many cases the system freezes tight and network connection drops – sometimes bringing my entire network down – and then recovers when I unplug the network cable to the rig. (I don’t suspect that is a direct problem with the nVidia driver, but a result of the OS getting hosed as a result of the driver.

I have also swapped out risers, filled in the slots systematically to try to isolate slots, specific GPU, specfic riser, etc. I have also swapped out power supplies and CPUs.

All this brings me back to the memory on the GPUs and the interaction between the driver.

I have tried also running the rig on Windows 10 with the latest build and latest nVidia drivers, and get instability there as well, but I have not been able to determine if the cause is the same.

If anyone has any further ideas, success, or would like me to test something I’d be happy to.

Thank you,

Scott

I too sometimes get difficulties connecting to the network (wifi in my case) after those shutdowns happen. Even after reboot. Also have two AsRock BTC+ mobos and, sadly, both suck like that. You’re lucky if you’re able to run even 8 GPUs on them without problems, I can’t run even 4 reliably.

Just built a rig with my other Asrock BTC+ mobo with 2 Aerocool PSUs and 4 1080ti GPUs and it fails quietly, no loud noises usually, nothing like that. But, yeah, some times, unpredictibly it fails and those loud noizes start. I’m not sure what makes them, as I’ve never given the system enough frying time to check. Seemingly, some one fan. Might be CPU fan, might be main PSU fan, I don’t know. Likely not GPU fan. Additional PSU just shuts off, seemingly, due to overcurrent usually, and the rest of the GPUs turn off, as they can’t remain powered on when some other turned off GPUs are plugged into PCIE. But very very occasionally it’s the main PSU that fails and reboots the system. And very very occasionally GPUs connected via additional PSU disconnect but the additional PSU stays on and they stay powered.

Interestingly, this other rig reliably fails when I launch >3 1080Ti GPUs. And I did wire those power connectors that are specific to this mobo that need to be wired when you put in >3 GPUs in. It seems to me like it might be a power issue related to that part of the mobo.

FYI, I’m using 8Gb of RAM.

When I connect 4 GPUs the system can fail with 2 GPUs working if those GPUs are the ones wired to the different PSU than mobo. And things can work okay even with 3 GPUs if I only run one GPU on the additional PSU.

@ Scott
Your problem sounds more like an electrical one, check phases of the PSUs used and use a galvanic isolator for the network port.

@Generix Yes it does have symptoms of a power issue, but with the systematic device rotation and substitution coupled with a known working system it tends to draw away from that scenario as a primary root cause. My next step will be to take the PSUs from the known good 13 GPU rig and put it with these GPUS and motherboard and see if the problem persist. If it does I will swap the suspect GPUS to the known good mobo and PSUs and see if the problem follows the GPUs. This will be done after I determine the memory brand of the known good GPUs to determine the variation if any. I’m stabbing at a correlation between the power draw and the Micron memory (perhaps when even slightly overclocked), and perhaps laced in with the driver version.

The addition of a galvantic isolator, if you are serious, is hardly a method of fixing the problem. However, if it were at all possible to add such a device it may prove to be beneficial during the troubleshooting process so as to eliminate the downtime risk to the network.

Thank you,

Scott

Yeah, I had an idea that maybe it’s an electrical issue, but I’m not certain of that. Interestingly, I’ve just had this experience:

  1. Connected 3 1080ti GPUs. One to the main supply and 2 to alternative.
  2. Started 2 GPUs. One on main supply and one on alternative.
  3. One of them (the one on alternative PSU) got “disconnected”.
  4. I started another GPU on alternative PSU and it worked successfully.

Also, tried linking PSU grounds if anybody was interested in that. Didn’t fix the problem.

When I turn on just 3 GPUs (into PCIE), but only 1 of them is from the main supply, I usually get the usual alt PSU fail.

Hello,

Can those that are having the issue please let me know they brand of memory that the GPUs have? I am not sure what tools are available under Linux to determine that but on Windows I use GPU-Z by Tech Power Up.

Thank you,

Scott

Micron on my 1080 Ti.

Actually, it seems that if I even use two GPUs and a single PSU the mobo is going to glitch at some point and the system is going to reboot. Takes a while, a day for example, but it’s gonna glitch and reboot. I don’t know what it is. Crazy mobo, or what.

I’m also experiencing this issue in an older motherboard with 5 GPUs. I’m using 1x USB risers with my setup. Are you all using the same? Have you tried seating that specific GPU directly on the motherboard?