940MX black screen and 100% Xorg CPU usage on system resume

dsd_endless · December 27, 2017, 6:34pm

Hi,

On GeForce 940MX found embedded in Asus X705UQ, the system appears to hang during resume after S3 suspend. Testing with driver version 387.34.

At this point the screen is black, but the computer is responsive over ssh.

There are no error messages shown in dmesg. Xorg is using 100% CPU. Backtrace at this point is:

#0  0x00007ff9d6029262 in ?? () from /usr/lib64/xorg/modules/drivers/nvidia_drv.so
#1  0x00007ff9d602ddf9 in ?? () from /usr/lib64/xorg/modules/drivers/nvidia_drv.so
#2  0x00007ff9d602d389 in ?? () from /usr/lib64/xorg/modules/drivers/nvidia_drv.so
#3  0x00007ff9d5fc2b21 in ?? () from /usr/lib64/xorg/modules/drivers/nvidia_drv.so
#4  0x00007ff9d5ffe230 in ?? () from /usr/lib64/xorg/modules/drivers/nvidia_drv.so
#5  0x00007ff9d5fcdba1 in ?? () from /usr/lib64/xorg/modules/drivers/nvidia_drv.so
#6  0x00007ff9d6536fd1 in ?? () from /usr/lib64/xorg/modules/drivers/nvidia_drv.so
#7  0x000000000265a070 in ?? ()
#8  0x0000000001e899f0 in ?? ()
#9  0x000000000272cdd0 in ?? ()
#10 0x000000000047f80e in CMapEnterVT ()
#11 0x000000000048a98c in xf86XVEnterVT ()
#12 0x0000000000477dd0 in xf86VTEnter ()
#13 0x000000000049cb98 in systemd_logind_vtenter ()
#14 0x000000000049ceb5 in message_filter ()
#15 0x00007ff9de1201ad in dbus_connection_dispatch () from /lib64/libdbus-1.so.3
#16 0x00007ff9de1205c8 in _dbus_connection_read_write_dispatch () from /lib64/libdbus-1.so.3
#17 0x0000000000496981 in socket_handler ()
#18 0x000000000059df41 in ospoll_wait ()
#19 0x0000000000596f9b in WaitForSomething ()
#20 0x0000000000435603 in Dispatch ()
#21 0x00000000004398a0 in dix_main ()

nvidia-bug-report output: nvidia-bug-report Asus X705UQ when "hung" during S3 resume · GitHub
(this was captured over ssh while the system was in this hung state)

Interestingly if I VT switch away from X before S3 suspend, I can suspend from there, and it will also resume fine to that state. However upon then changing VT back to X, the hang state occurs and I can’t recover.

Please let me know how I can help further.

generix · December 29, 2017, 12:29am

Looks like some acpi problem. Try using kernel parameter
acpi_osi=! acpi_osi=“Windows 2009”
Please run acpidump and attach output to post.

dsd_endless · December 29, 2017, 12:33pm

Thanks for the suggestion. With those params, dmesg reports:

[    0.100982] ACPI: Disabled all _OSI OS vendors
[    0.100982] ACPI: Added _OSI(Module Device)
[    0.100982] ACPI: Added _OSI(Processor Device)
[    0.100982] ACPI: Added _OSI(3.0 _SCP Extensions)
[    0.100982] ACPI: Added _OSI(Processor Aggregator Device)
[    0.100982] ACPI: Added _OSI(Windows 2009)

There is no change to the resume behaviour, the bug still reproduces exactly as described above.

acpidump: Asus X705UQ acpidump · GitHub

generix · December 30, 2017, 3:22pm

None of the usual suspects found in the acpidump.
See if this is reproducible by using bbswitch:
Stop X
unload nvidia modules
load bbswitch
turn off nvidia gpu using bbswitch
turn on nvidia gpu using bbswitch
use cat /proc/acpi/bbswitch to see if it is really on
load nvidia modules
start X

generix · December 30, 2017, 3:53pm

to get a bit more info, use kernel parameter
acpi.aml_debug_output=1
I think you might be hit by this:
https://bugzilla.kernel.org/show_bug.cgi?id=156341
due to following code in ssdt3

While (\_SB.PCI0.LKS1 < 0x07)
            {
                Sleep (One)
            }

dsd_endless · January 1, 2018, 12:44pm

This works fine, X came up using nvidia again. I then did a suspend/resume and it froze again though.

I took a look at https://bugzilla.kernel.org/show_bug.cgi?id=156341
With nouveau enabled:

root@endless:/sys/bus/pci/devices/0000:01:00.0/power# cat runtime_enabled 
enabled
root@endless:/sys/bus/pci/devices/0000:01:00.0/power# cat runtime_status 
suspended
root@endless:/sys/bus/pci/devices/0000:01:00.0/power# cat control 
auto
root@endless:/sys/bus/pci/devices/0000:01:00.0/power# lspci
00:00.0 Host bridge: Intel Corporation Device 5904 (rev 02)
00:02.0 VGA compatible controller: Intel Corporation Device 5916 (rev 02)
00:04.0 Signal processing controller: Intel Corporation Skylake Processor Thermal Subsystem (rev 02)
00:14.0 USB controller: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller (rev 21)
00:14.2 Signal processing controller: Intel Corporation Sunrise Point-LP Thermal subsystem (rev 21)
00:15.0 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0 (rev 21)
00:15.1 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #1 (rev 21)
00:16.0 Communication controller: Intel Corporation Sunrise Point-LP CSME HECI #1 (rev 21)
00:17.0 SATA controller: Intel Corporation Sunrise Point-LP SATA Controller [AHCI mode] (rev 21)
00:1c.0 PCI bridge: Intel Corporation Device 9d10 (rev f1)
00:1c.4 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1)
00:1c.5 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #6 (rev f1)
00:1e.0 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO UART Controller #0 (rev 21)
00:1e.2 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO SPI Controller #0 (rev 21)
00:1f.0 ISA bridge: Intel Corporation Device 9d4e (rev 21)
00:1f.2 Memory controller: Intel Corporation Sunrise Point-LP PMC (rev 21)
00:1f.3 Audio device: Intel Corporation Device 9d71 (rev 21)
00:1f.4 SMBus: Intel Corporation Sunrise Point-LP SMBus (rev 21)
01:00.0 3D controller: NVIDIA Corporation GM108M [GeForce 940MX] (rev a2)
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
03:00.0 Network controller: Qualcomm Atheros QCA9377 802.11ac Wireless Network Adapter (rev 31)
root@endless:/sys/bus/pci/devices/0000:01:00.0/power# cat runtime_status 
active
root@endless:/sys/bus/pci/devices/0000:01:00.0/power# cat runtime_status 
suspending
root@endless:/sys/bus/pci/devices/0000:01:00.0/power# cat runtime_status 
suspended

So I do not seem to be facing that issue. Also nouveau can suspend/resume just fine. The issue only happens when using the official nvidia driver.

I added acpi.aml_debug_output=1 and acpi.debug_layer=0x10000000 acpi.debug_level=0xffffffff but “dmesg | grep -i ‘ACPI DEBUG’” output is empty, so it doesn’t look like any debug statements are being hit here.

generix · January 2, 2018, 2:34pm

Ok. So it’s open to know whether the X or the kernel driver is hanging. Did you try
stop X
suspend
resume
start X
?
If that works, does starting acpid help?

dsd_endless · January 2, 2018, 5:21pm

Suspend/resume was fine. Then on starting X again, the kernel logged errors:

[  108.102157] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[  108.102186] NVRM: rm_init_adapter failed for device bearing minor number 0
[  108.259364] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[  108.259390] NVRM: rm_init_adapter failed for device bearing minor number 0
[  108.409171] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[  108.409222] NVRM: rm_init_adapter failed for device bearing minor number 0
[  108.558814] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[  108.558843] NVRM: rm_init_adapter failed for device bearing minor number 0
[  108.709373] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[  108.709498] NVRM: rm_init_adapter failed for device bearing minor number 0
[  108.858905] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[  108.858934] NVRM: rm_init_adapter failed for device bearing minor number 0

and X failed to launch with these errors:

[   328.964] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA GPU at PCI:1:0:0.  Please
[   328.964] (EE) NVIDIA(GPU-0):     check your system's kernel log for additional error
[   328.964] (EE) NVIDIA(GPU-0):     messages and refer to Chapter 8: Common Problems in the
[   328.964] (EE) NVIDIA(GPU-0):     README for additional information.
[   328.964] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA graphics device!
[   328.964] (EE) NVIDIA(0): Failing initialization of X screen 0

If testing acpid is still relevant, can you specify at which point in the test sequence I should start it?

generix · January 2, 2018, 6:44pm

At least there’s an error now. acpid should be irrelevant.
Can you check if the 384 driver works?

dsd_endless · January 2, 2018, 7:03pm

Reproduced the same issue on 384.98.

dsd_endless · January 9, 2018, 3:57pm

Also reproduced on driver version 390.12
Also reproduced on Asus X542UQ (NVIDIA GM108 940MX).

generix · January 9, 2018, 8:04pm

Another idea, tried
pcie_port_pm=off
kernel parameter?
Looks like the ASUS/940MX combo has regularly problems waking up again telling by the threads in this forum.

dsd_endless · January 18, 2018, 10:19pm

That makes no difference, problem still exists.

tipi · May 15, 2018, 7:04am

Same thin. The most weirdest thing is that I can put my laptop to suspend mode two to five times before t hangs. I am pretty sure the only problem is software skips the part where to turn the screen (lights) on. Everything else works fine even when the screen is blank.

borisov.ks · May 24, 2018, 8:16am

Hi all. I have same issue too. I have Asus laptop N705UN-GC113T with hibrid graphics intel i7-8550U and Nvidia MX150 (based on 940MX).

It seems to be or, if i use intel gpu (laptop suspend and succsessful resume) But on nvidia a have a troble: after resume i have black screen with this str "[ … ] NVRAM: Xid (PCI: … ): 32 etc, computer is full freeze, not answer any command and virtual terminals (ctr + alt + f1) and my CPU is full load

uname -r

4.13.0-43-generic

lsb_release -a

No LSB modules are available.
Distributor ID:	elementary
Description:	elementary OS 0.4.1 Loki
Release:	0.4.1
Codename:	loki

Nvidia driver

nvidia-396 - NVIDIA binary driver - version 396.24