GPU has fallen of the bus GTX 1060

I am having an issue with the GTX 1060 on a pcspecialist laptop. The laptop is running Linux Mint 18.3, and I didn’t experience this issue until recently. The laptop has been used for gaming, but I have had a pause in gaming for a few weeks. There have obviously been some updates since last time things were working, but since it has been over several weeks it is difficult to pin down.

The specific problem is that launching 3D games will cause the screen to go black. I can SSH into the laptop, and the journal will report:

The GPU has fallen of the bus.

I have tried this on both 384.111 and 290.25 drivers. Both experience the problem. I will attach the nvidia-bug-report files for both drivers.
nvidia-bug-report.log.old.gz (275 KB)
nvidia-bug-report.log.gz (274 KB)

A little update:

I was running the 4.13 kernel, when I switched to the 4.10 kernel the problem seems to have stopped. It might be too early to draw conclusions, but for now it seems to be working.

EDIT: It happened again after playing for some time. I ssh’d into the system and generated a new nvidia bug report, attaching it.
nvidia-bug-report.log.gz (275 KB)

Maybe check if the gpu’s overheating, use something like
nvidia-smi -q -l 2 -d TEMPERATURE >temperature.log
in the background. Then check the log when issue strikes. Clean the heat spreader/fans if temperature rises too high.

Temperature is fine, I kept checking it via SSH while the game was running, and temperature never rose above 67 C as I could see.

This does not always happen immediately. Sometimes it happens as soon as any 3D game is launched, but sometimes it will launch without issue and keep running for a while. But it always does happen, making it impossible to play games. I have seen a few other topics here regarding the GeForce 10xx series, so perhaps there is some issue currently with the 10xx series on Linux.

Ok, then maybe also check if setting maximum performance in powermizer instead of adaptive works around that.
https://devtalk.nvidia.com/default/topic/1030326/linux/gpu-fallen-off-the-bus-while-running-more-demanding-demos-or-games/

I changed the powermizer setting to maximum performance, but there was no change. As before, I could play games for a short while, but then the screen turns black. I can ssh into the machine and the journal has the same error.

GPU has fallen of the bus message means hardware failure. Either GPU is overclocked or overheated or both or just faulty. Might also be power supply problem. Setting lower power limit will probably help.

Sorry for being so late to replay here, but I have been testing a lot.

The GPU is not overclocked, and it does not overheat. As I stated before I have been monitoring the temperature up to the problem occurs, and temperature is steady at around 65 C.

Quick summary: Optimus laptop, but the integrated gpu is disabled in BIOS, so only running the nvidia gpu. I have been testing on both Linux Mint, and Debian 9, since Debian runs old kernel (4.9) and nvidia driver (375.66). The problem occurs on both Mint and Debian. However, it seems to take longer when using older nvidia driver. With the 390 driver the problem occurs very quickly.

At first I only thought the problem occurred with demanding games, but playing less demanding games also make the problem occur. It just takes longer time, sometimes up to two hours.

Based on this, it really looks like a hardware problem. So I decided to install Windows 10 on the laptop. And with Windows, I have no problem at all. I was able to play several games for hours without issue. The longest session was 6,5 hours straight playing Sims 3, followed by two Unigine benchmarks on extreme settings. Still no problems.

So I can’t help but suspect there is some issue with the Linux Nvidia driver.

At the moment I will dual boot Windows and Linux for now, and I shall ensure to do more tests.

Daerandin, how behaves Powermizer (is that name right?) in your NVIDIA X Server Settings? As I have described in my thread (you know it, you have written there: https://devtalk.nvidia.com/default/topic/1030326/linux/gpu-fallen-off-the-bus-while-running-more-demanding-demos-or-games/), it behaves strange in my case. Performance scaling obviously works (fan noise changes with GPU load and it has the computing power it should have with demanding tasks), but I have max performance level indicated in all cases (even without any load and with quiet fans) and if I try to change preferred mode to another value than ‘Auto’ the setting is set back to ‘Auto’ when I run settings again. I seems, like PowerMizer settings aren’t “wired” to real card and aren’t stored anywhere.

And, as Generix has written, power settings may be related to our problem. I agree with you that we are experiencing the same problem and that it is some software bug. I also have installed Win10 and everything works there without problem. I haven’t yet tried older driver and kernel as you did, however. That old version is no more in Ubuntu repository and I must to install it from binary downloaded from Nvidia and I want to do it in reversible way in Ubuntu …