(SOLVED) resume from suspend not working with 980 Ti, drivers 352 - 370, kernels 3.16 - 4.4

i have an ASUS ROG G751 laptop, for the last few months the nvidia driver crashes on resume. I run Arch Linux and update weekly. i can normally get about 5-7 successful resumes before a crash. it doesn’t matter whether i’m on battery or A/C. it doesn’t matter if i’m on a text console or in X. my BIOS is current.

wifi will break but if i have an ethernet cable plugged in and am fast, i can get a dmesg before the machine hangs. the tail of it is similar to previously posted logs for the GPU idle errors.

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor DRAM Controller (rev 06)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller (rev 06)
00:14.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI (rev 05)
00:16.0 Communication controller: Intel Corporation 8 Series/C220 Series Chipset Family MEI Controller #1 (rev 04)
00:1a.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2 (rev 05)
00:1b.0 Audio device: Intel Corporation 8 Series/C220 Series Chipset High Definition Audio Controller (rev 05)
00:1c.0 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #1 (rev d5)
00:1c.2 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #3 (rev d5)
00:1c.3 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #4 (rev d5)
00:1d.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #1 (rev 05)
00:1f.0 ISA bridge: Intel Corporation HM87 Express LPC Controller (rev 05)
00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05)
00:1f.3 SMBus: Intel Corporation 8 Series/C220 Series Chipset Family SMBus Controller (rev 05)
01:00.0 VGA compatible controller: NVIDIA Corporation GM204M [GeForce GTX 980M] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GM204 High Definition Audio Controller (rev a1)
3b:00.0 Network controller: Intel Corporation Wireless 7260 (rev bb)
3c:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 10)

01:00.0 VGA compatible controller: NVIDIA Corporation GM204M [GeForce GTX 980M] (rev a1) (prog-if 00 [VGA controller])
Subsystem: ASUSTeK Computer Inc. Device 22da
Flags: bus master, fast devsel, latency 0, IRQ 32
Memory at ec000000 (32-bit, non-prefetchable)
Memory at c0000000 (64-bit, prefetchable)
Memory at d0000000 (64-bit, prefetchable)
I/O ports at e000
[virtual] Expansion ROM at 000c0000 [disabled]
Capabilities:
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia

Right then. If none of the remedial suggestions I’ve made have yielded a consistently functioning resume from suspend then we are now officially grasping at straws (unless someone else has a further insight into this issue).

According to the following article your RM650i was made by CWT, a respected OEM so there should be no problem with its quality.

July 20, 2016
Corsair RM650x PSU Review - Tom’s Hardware
[url]http://www.tomshardware.com/reviews/corsair-rm650x-psu,4611.html[/url]

:: Channel Well Technology Co.,Ltd. ::
[url]http://www.cwt.com.tw/[/url]

I’m assuming that you researched the Wattage of the power supply you would require to satisfy your current PC’s configuration plus a little extra to accommodate any reasonable expansion? If not:

EVGA - Power Meter
[url]http://www.evga.com/power-meter/[/url]

From the product page for Version 352.55 (the oldest nVidia driver that supports the GTX 980 Ti)

[i]"Known Issues with this release:

  • Resuming from suspend may not be reliable on GeForce GTX 9xx boards in some configurations."[/i]

Drivers | GeForce
[url]http://www.geforce.com/drivers/results/92826[/url]

It seems you have one such configuration and that to save on idle power consumption you’ll have to shut your machine down or if possible schedule it to do so after a user defined period of inactivity.

BTW. Installing mate-themes via the Synaptic Package Manager will yield an attractive charcoal theme called BlackMATE and installing grml-rescueboot will allow you to loop-mount Linux Mint .iso images so that installing a fresh OS can occur at HDD or SSD speeds.

Grub2/ISOBoot - Community Ubuntu Documentation
[url]https://help.ubuntu.com/community/Grub2/ISOBoot[/url]

One more point to consider:

Though it employs an nVidia GPU, your EVGA 980 ti hybrid is still an EVGA product sporting EVGA’s custom PCB, VRM and firmware / BIOS all of which differ from a card specifically manufactured by nVidia or any of the other nVidia-based graphics card manufacturers. Perhaps starting a help thread on the EVGA Forums may draw in some further user insight into resolving the resume from suspend issue:

EVGA GeForce 900/TITAN X Series - EVGA Forums
[url]http://forums.evga.com/EVGA-GeForce-900TITAN-X-Series-f99.aspx[/url]

To save from going over covered ground you could quote, copy and paste selected portions of this thread to more quickly bring EVGA forum members up-to-speed re which steps have already been taken.

FWIW

I suspect that how a *motherboard’s UEFI / BIOS’ ‘Power’ and ‘DDR power down mode’ and ‘S3 Video Repost’ sections (or however they’re worded) are adjusted may influence the effectiveness of the following info:

pm-suspend(8): Suspend/Hibernate your computer - Linux man page
[url]https://linux.die.net/man/8/pm-suspend[/url]

Power management/Suspend and hibernate - ArchWiki
[url]https://wiki.archlinux.org/index.php/Power_management/Suspend_and_hibernate[/url]

UnderstandingSuspend - Ubuntu Wiki
[url]https://wiki.ubuntu.com/UnderstandingSuspend[/url]

(Power Management S3 Tricks and Tips)
Kernel/Reference/S3 - Ubuntu Wiki
[url]https://wiki.ubuntu.com/Kernel/Reference/S3[/url]

*EDIT

Some clues from your motherboard’s .pdf manual:

Page 66:

[i]- ‘Native ASPM [Disabled]’

  • ‘DMI Link ASPM Control [Disabled]’
  • ‘ASPM Support [Disabled]’[/i]

Page 67:

[i]- ‘DMI Link ASPM Control [Disabled]’

  • ‘PEG ASPM [Disabled]’[/i]

Page 71:

[i]- ‘ErP Ready [Disabled]’

  • ‘Deep S4 [Disabled]’[/i]

E10768__Z170M-PLUS_UM_WEB.pdf
[url]http://dlcdnet.asus.com/pub/ASUS/mb/LGA1151/Z170M-PLUS/E10768__Z170M-PLUS_UM_WEB.pdf?_ga=1.262171084.1587441999.1477231416[/url]

Z170M-PLUS | Motherboards | ASUS Global
[url]https://www.asus.com/Motherboards/Z170M-PLUS/[/url]

Many thanks for all the hints and tips. This turned a bit complicated to thoroughly go through all the options so I’ll turn this into a weekend project.
It’s my production system so I don’t want to leave it in a non-working state.

I’ve also got another hardware platform made available and will install the card there just to see if the behavior is similar.

A production machine? Do you have a backup graphics card in case ESD or another unforeseen disaster strikes? Murphy’s law being what it is.

Have you seen this?

Post #3

“Barteks2x, I have the same laptop, you have to add acpi_osi=”!Windows 2013" to kernel command line for suspend/resume to work for kernels >3.14"

Struggling with my Geforce GT 635M and the nvidia Linux driver. - NVIDIA Developer Forums
[url]https://devtalk.nvidia.com/default/topic/955952/linux/struggling-with-my-geforce-gt-635m-and-the-nvidia-linux-driver-/[/url]

I have not seen that. Thanks for the tip, I’ll try.

The machine has on board graphics if the nvidia card should fail.

What makes this difficult for me to troubleshoot is the erratic nature of the failures. It doesn’t alway fail to resume. Sometimes it does, sometimes it don’t. I have not been able to diagnose any pattern for when it fails.

Have you tried clearing the CMOS since the hit & miss resume from suspend behavior began?

Asus is pretty diligent about releasing new UEFI / BIOS updates. You may want to check in once per month for any new ones.

Exactly the same problem here with:

Nvidia Driver Version: 375.10
Xorg: 1.18.4 (11804000)
Kernel: 4.8.6.1
Mainboard: Asus Z170-P D3
Bios: 2002
GPU: GTX Titan X (Maxwell)

Sometimes the system resumes correctly. But in > 50% of the cases it shows a black screen and when I SSH, I see Xorg on 100% load. Kill Commands have no effect.

If the machine is on sleep for some hours, resumes fail in 90+ percent of all cases.

nvidia-bug-report:
http://www.naanoo.com/upstream/nvidia-bug-report.log.gz

Tried adding acpi_osi=“!Windows 2013” to command line. No perceived change.

I have had another system available and moved the GPU to this temporarily to troubleshoot.
Resume from suspend has been working perfect in this setup for at least 10 resumes now with no fails and I’m tempted to conclude it will not fail in this system. Sadly, it is an old antique system so I need to move the GPU back to my modern problem system for daily work. So, I believe the GPU has no problems in itself, and the problem is not the driver.

The working system is an old Intel Core 2 platform, from now named Core, and the new problematic system is a Skylake. The Skylake system WITHOUT GPU resumes without any fails.

I now believe (expert advice welcome) that the problem is in the combination Mobo / GPU / BIOS, hopefully as a result of a BIOS setting.

Details about the platforms:
Core:
Mobo: Gigabyte EP45-UD3LR
CPU: Intel Core2 Quad
PSU: Nexus 600
Intregrated graphics: No

Skylake:
Mobo: Asus Z170M-PLUS
CPU: Intel Core i7-6700
PSU: Corsair RM650i
Integrated graphics: yes (unused)

I’m really in the dark here but I have one guess. The GPU resumes good in the old system, that I believe is slow. As naanoo writes, the GPU seems to fail more if it has been on suspend longer. Maybe the GPU discharges more and takes longer to get enough power to wakeup and respond? My guess now is that the Skylake system resumes faster, and that the GPU is not given enough time to wakeup and respond. The OS then timeouts the response from the GPU and Xorg crashed. Is this reasonable?

The Skylake system runs the latest BIOS and CMOS has been cleared as instructed. Still failing.

I ssh’d into the Skylake system when Xorg had crashed on resume.
The GPU shows up in lspci. Does this mean that it is awake and responding correctly?

@ JonathanAnderson

What speeks against your Mobo/GPU/BIOS-Thesis → on the same machine on wich I have the problem with Ubuntu 16.04 suspend/resume works fine … 100% the times … since 1+ year ongoing. Still now, when I boot the other drive.

What I am sure of:

On my machine it has nothing to do with:

  • audio interface
  • other pcie cards
  • usb devices
  • the monitors
  • hard/ssd-drives

… I am testing for 2 oder 3 weeks already ;-)

Sorry naanoo, I did not understand this.
Do you mean that you are DUAL booting, and that the problem is only with Ubuntu, not with Windows?

naanoo, do you have CUDA installed?

I have two SSDs:

  1. Arch Linux
  2. Ubuntu Gnome 16.04

With Ubuntu there are no problems suspending / resuming.

Yes, I have CUDA installed.