Hi all,
I boot my laptop successfully and after variable amount of time the screen goes black.
I’m using laptop Alienware M17xR3:
- OS: Ubuntu 18.04
- Driver: nvidia-390.116 driver
- Optimus: OFF(disabled from the bios)
- Video Card: GTX 580M
I use the laptop in 2 variants:
- With Optimus - no issues, PowerMizer changes between levels 0 and 1.(Never goes to 2 and 3)
- Without Optimus - after some time the screen goes black. In this state PowerMizer changes from level 0 to level 3.
I’ve not used the “Without Optimus” variant for over 2 years and lastly it was working under Fedora without issues. Since 1 week ago I decided to give it a try again and noticed the black screen issue.
Here’s what I did:
- Cleaned the FAN.
- Changed the GPU paste.
- Tested between adaptive and performance mode for PowerMizer.
- Tested with older driver versions. For example nvidia-340.
- Run the laptop open, so that it doesn't overhear. I never saw the temperature to go over 60 degrees Celsius.
Unfortunately after all that the issue still persisted.
Here’s what I noticed:
- After Black Screen I can ssh to the laptop and work it with
- After Black Screen if I try to open nvidia-settings from ssh session(X forwarded) I get the error: GPU has fallen of the bus
- To get Black Screen I don't need to stress the GPU. By doing nothing the issue arises.
- Stressing the GPU with a game doesn't make the issue to appear faster. Current record without black screen is ~30 minutes on the game Dirt Rally.
- Issue depends on the Power Source: On 1 socket at home most of the times the black screen comes even before I can login.
I’m attaching 2 bug reports. One is before the issue happens and the other is after the issue happens.
Before: https://drive.google.com/open?id=1MW3-SGYXagEVLHeiV6k_BGRPlP0gX2A5
After: https://drive.google.com/open?id=14yku7Cx_QQvXrtwrBGMr-SDRClgtCfXO
I tried to diagnose the log, but I didn’t managed to find something that could hint at what the issue is. Here are some of the interesting things from the log:
- [ 1225.709] (EE) NVIDIA(GPU-0): WAIT (2, 9, 0x8000, 0x0000bd98, 0x0000bde4)
- [ 1201.703778] NVRM: GPU at PCI:0000:01:00: GPU-b6adb8e6-62e4-e943-114e-9d7cb0ee88bc [ 1201.703781] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus. [ 1201.703783] NVRM: GPU at 0000:01:00.0 has fallen off the bus. [ 1201.703792] NVRM: A GPU crash dump has been created. If possible, please run NVRM: nvidia-bug-report.sh as root to collect this data before NVRM: the NVIDIA kernel module is unloaded. [ 1222.704830] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000857c:0:0:0x0000000f [ 1222.705446] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000857c:0:0:0x0000000f [ 1222.706072] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000857c:0:0:0x0000000f
- nvidia-settings -q all:
Unable to init server: Could not connect: Connection refused
No protocol specifiedERROR: Unable to find display on any available system
No protocol specified
ERROR: Unable to find display on any available system
- xrandr --verbose:
No protocol specified
Can’t open display :0 - /usr/bin/nvidia-smi --query
Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. Reboot the system to recover this GPU
I’m starting to wonder whether the issue doesn’t come from the GPU, but from my Power Source or something else. I really want to understand from what the issue arises and how I can fix it.