Black Screen after variable amount of time running

Hi all,

I boot my laptop successfully and after variable amount of time the screen goes black.

I’m using laptop Alienware M17xR3:

  • OS: Ubuntu 18.04
  • Driver: nvidia-390.116 driver
  • Optimus: OFF(disabled from the bios)
  • Video Card: GTX 580M

I use the laptop in 2 variants:

  • With Optimus - no issues, PowerMizer changes between levels 0 and 1.(Never goes to 2 and 3)
  • Without Optimus - after some time the screen goes black. In this state PowerMizer changes from level 0 to level 3.

I’ve not used the “Without Optimus” variant for over 2 years and lastly it was working under Fedora without issues. Since 1 week ago I decided to give it a try again and noticed the black screen issue.
Here’s what I did:

  • Cleaned the FAN.
  • Changed the GPU paste.
  • Tested between adaptive and performance mode for PowerMizer.
  • Tested with older driver versions. For example nvidia-340.
  • Run the laptop open, so that it doesn't overhear. I never saw the temperature to go over 60 degrees Celsius.

Unfortunately after all that the issue still persisted.
Here’s what I noticed:

  • After Black Screen I can ssh to the laptop and work it with
  • After Black Screen if I try to open nvidia-settings from ssh session(X forwarded) I get the error: GPU has fallen of the bus
  • To get Black Screen I don't need to stress the GPU. By doing nothing the issue arises.
  • Stressing the GPU with a game doesn't make the issue to appear faster. Current record without black screen is ~30 minutes on the game Dirt Rally.
  • Issue depends on the Power Source: On 1 socket at home most of the times the black screen comes even before I can login.

I’m attaching 2 bug reports. One is before the issue happens and the other is after the issue happens.
Before: https://drive.google.com/open?id=1MW3-SGYXagEVLHeiV6k_BGRPlP0gX2A5
After: https://drive.google.com/open?id=14yku7Cx_QQvXrtwrBGMr-SDRClgtCfXO

I tried to diagnose the log, but I didn’t managed to find something that could hint at what the issue is. Here are some of the interesting things from the log:

  • [ 1225.709] (EE) NVIDIA(GPU-0): WAIT (2, 9, 0x8000, 0x0000bd98, 0x0000bde4)
  • [ 1201.703778] NVRM: GPU at PCI:0000:01:00: GPU-b6adb8e6-62e4-e943-114e-9d7cb0ee88bc [ 1201.703781] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus. [ 1201.703783] NVRM: GPU at 0000:01:00.0 has fallen off the bus. [ 1201.703792] NVRM: A GPU crash dump has been created. If possible, please run NVRM: nvidia-bug-report.sh as root to collect this data before NVRM: the NVIDIA kernel module is unloaded. [ 1222.704830] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000857c:0:0:0x0000000f [ 1222.705446] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000857c:0:0:0x0000000f [ 1222.706072] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000857c:0:0:0x0000000f
  • nvidia-settings -q all:

    Unable to init server: Could not connect: Connection refused
    No protocol specified

    ERROR: Unable to find display on any available system

    No protocol specified

    ERROR: Unable to find display on any available system

  • xrandr --verbose:

    No protocol specified
    Can’t open display :0

  • /usr/bin/nvidia-smi --query

    Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. Reboot the system to recover this GPU

I’m starting to wonder whether the issue doesn’t come from the GPU, but from my Power Source or something else. I really want to understand from what the issue arises and how I can fix it.

The config space of the GPU is all 0xff, i.e. something cut the power to it. Maybe some defective voltage regulators on the mainboard. Does this work if you disconnect every accessory and run on battery?

Hi generix,

Unfortunately I can’t test without battery. The battery can hold for around ~30 seconds and then dies.
I have nothing connected to the laptop, no mouse, no headphones, nothing.

Tough luck. What happens if you remove the battery?

Hi generix,

With or without battery the behavior is the same.
Do you think that UPS or something can help with issue?

IDK, really hard to say since I don’t quite have an idea what could trigger this.