Problem after suspend: Asynchronous wait on fence.

Hi,

I’m having some issues when awaking from suspend to RAM which seems to be related with the NVIDIA card. The point is that suspend/resume works perfectly first but after repeating the suspend/resume operation the “Asynchronous wait on fence” error is fired on dmesg and the kscreenlocker_greet, Chromium and Firefox eats 100% of CPU. If I stop the SDDM and restart the Xorg server, then the Xorg also eats 100% of CPU.

It seems that the problem is reproduced after suspending and resuming 10 or 11 times. When the problem is reproduced I can see the mouse pointer with a blank screen and if I try to move the pointer it has a lot of lag. If I’m patient, I can also change to TTY1 (I can take up to 1 minute to change the TTY) in order to collect data.

That’s the dmesg output while resume:

[34344.779303] ACPI: Low-level resume complete
[34344.779368] ACPI: EC: EC started
[34344.779368] PM: Restoring platform NVS memory
[34344.780006] do_IRQ: 0.55 No irq handler for vector
[34344.780053] Enabling non-boot CPUs ...
[34344.780096] x86: Booting SMP configuration:
[34344.780096] smpboot: Booting Node 0 Processor 1 APIC 0x2
[34344.780919] CPU1 is up
[34344.780965] smpboot: Booting Node 0 Processor 2 APIC 0x1
[34344.781828] CPU2 is up
[34344.781865] smpboot: Booting Node 0 Processor 3 APIC 0x3
[34344.782716] CPU3 is up
[34344.785933] ACPI: Waking up from system sleep state S3
[34344.855515] ACPI: EC: interrupt unblocked
[34344.856492] nvidia 0000:01:00.0: Refused to change power state, currently in D3
[34345.004170] ACPI: EC: event unblocked
[34345.119880] nvme nvme0: 4/0/0 default/read/poll queues
[34345.137091] ath10k_pci 0000:04:00.0: failed to get board id: -95
[34345.205456] ath10k_pci 0000:04:00.0: unsupported HTC service id: 1536
[34345.281021] usb 1-6: reset high-speed USB device number 2 using xhci_hcd
[34345.523085] usb 1-8: reset full-speed USB device number 4 using xhci_hcd
[34345.779262] OOM killer enabled.
[34345.779263] Restarting tasks ... 
[34345.784247] Bluetooth: hci0: using rampatch file: qca/rampatch_usb_00000302.bin
[34345.784249] Bluetooth: hci0: QCA: patch rome 0x302 build 0x3e8, firmware rome 0x302 build 0x111
[34345.795533] done.
[34345.809553] video LNXVIDEO:00: Restoring backlight state
[34345.809691] video LNXVIDEO:01: Restoring backlight state
[34345.810783] PM: suspend exit
[34345.810798] elogind-daemon[5638]: System resumed.
[34345.811265] elogind-daemon[5638]: Lid opened.
[34345.839517] Bluetooth: hci0: using NVM file: qca/nvm_usb_00000302.bin
[34345.918820] psmouse serio1: synaptics: queried max coordinates: x [..5656], y [..4692]
[34345.948878] psmouse serio1: synaptics: queried min coordinates: x [1282..], y [1160..]
[34350.749636] wlp4s0: authenticate with d8:fb:5e:83:13:d3
[34350.786687] wlp4s0: send auth to d8:fb:5e:83:13:d3 (try 1/3)
[34350.787590] wlp4s0: authenticated
[34350.789642] wlp4s0: associate with d8:fb:5e:83:13:d3 (try 1/3)
[34350.792588] wlp4s0: RX AssocResp from d8:fb:5e:83:13:d3 (capab=0x511 status=0 aid=63)
[34350.794598] wlp4s0: associated
[34350.794670] ath: EEPROM regdomain: 0x82d4
[34350.794671] ath: EEPROM indicates we should expect a country code
[34350.794672] ath: doing EEPROM country->regdmn map search
[34350.794673] ath: country maps to regdmn code: 0x37
[34350.794674] ath: Country alpha2 being used: ES
[34350.794674] ath: Regpair used: 0x37
[34350.794675] ath: regdomain 0x82d4 dynamically updated by country element
[34350.806391] wlp4s0: Limiting TX power to 23 (23 - 0) dBm as advertised by d8:fb:5e:83:13:d3
[34356.894594] Asynchronous wait on fence NVIDIA:nvidia.prime:3c13e timed out (hint:intel_atomic_commit_ready+0x0/0x42 [i915])

Everything is normal except those log entries:

[34344.856492] nvidia 0000:01:00.0: Refused to change power state, currently in D3

(...)

[34356.894594] Asynchronous wait on fence NVIDIA:nvidia.prime:3c13e timed out (hint:intel_atomic_commit_ready+0x0/0x42 [i915])

The only error that I can see on Xorg.0.log is:

[ 34374.292] (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x00004974, 0x00004a08)
[ 34381.292] (EE) NVIDIA(GPU-0): WAIT (1, 8, 0x8000, 0x00004974, 0x00004a08)
[ 34384.293] (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x00004974, 0x00004bc8)
[ 34391.293] (EE) NVIDIA(GPU-0): WAIT (1, 8, 0x8000, 0x00004974, 0x00004bc8)

That’s the lspci -kkk -vvvv output for the NVIDIA device when the error is reproduced:

01:00.0 3D controller: NVIDIA Corporation GM107M [GeForce GTX 960M] (rev a2)
        Physical Slot: 1
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 136
        Region 0: Memory at dc000000 (32-bit, non-prefetchable) 
        Region 1: Memory at b0000000 (64-bit, prefetchable) 
        Region 3: Memory at c0000000 (64-bit, prefetchable) 
        Region 5: I/O ports at e000 
        [virtual] Expansion ROM at dd000000 [disabled] 
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [250 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Capabilities: [258 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_drm, nvidia

The last time I reproduced the error I forgot to run nvidia-bug-report and I just collected the data that I shared here. The next time that the error is reproduced I’ll run nvidia-bug-report in order to share it here (sorry for that).

My configuration is: KD5 + Plasma + Nvidia.

My kernel version is: 5.2.7-gentoo.
NVIDIA driver version: 430.40.
Xorg server version: 1.20.5.
Plasma framework version: 5.60.0.
Plasma desktop version: 5.15.5.

Thank you for your time,
Regards.

I was able to reproduce the problem after suspending and resuming 12 times. I’m pretty sure that the issue is always reproduced after suspending 12 times. Once the issue is reproduced I must restart the system in order to be able to use the X system again. It’s not possible to suspend the system again (in order to check if after another suspension it can get back to D0).

Once the problem was reproduced, I tried to fix that by enabling the card again with the following commands:

echo 1    > /sys/bus/pci/devices/0000:01:00.0/enable
echo 'on' > /sys/bus/pci/devices/0000:01:00.0/power/control
echo 1    > /sys/bus/pci/devices/0000:01:00.0/rescan

But it didn’t help at all.

In the previous post I forgot to mention that I’m using the ACPI OSI “Windows 2009” on kernel parameters.

Here’s the nvidia-bug-report, which I collected now.

Thank you again for your time,
Regards.
nvidia-bug-report.log.gz (5.54 MB)

By checking the lspci values, I can see that when the resume fails those “Adress” and “Data” values are set to 0:

Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 0000000000000000  Data: 0000

On a fresh start the values are:

Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee08004  Data: 4026

After the first suspend/resume (which works) the values are:

Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee01004  Data: 402c

After the second suspend/resume (which also works) the Adress and Data values are kept the same:

Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee01004  Data: 402c

And the same Adress and Data are shown after the third suspend/resume (which also works).

I don’t know if it’s relevant or if it’s just a side effect derived from the problem, but the 0 address and the 0 data called my attention since it doesn’t seem to be normal.

Hi,

Since the nvidia-drivers README says that the MSI can cause problems with resume and since the MSI Address and Data seems to be “corrupted” after suspending the machine 12 times, I decided to disable MSI with the module parameter NVreg_EnableMSI=0.

It seems to solve the problem. I’ve suspended and resumed the maching for 14 times now. The performance seems to be OK and the power consumption too.

However, there seems to be something weird. After successfully suspend/resume for 12 times, the lspci output changed, and the Subsystem is not shown anymore (¡¡!!). That’s the lspci output for a fresh start (and for the first 11 suspend/resume) with MSI disabled:

01:00.0 3D controller: NVIDIA Corporation GM107M [GeForce GTX 960M] (rev a2)
        Subsystem: Dell GM107M [GeForce GTX 960M]
        Physical Slot: 1
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at dc000000 (32-bit, non-prefetchable) 
        Region 1: Memory at b0000000 (64-bit, prefetchable) 
        Region 3: Memory at c0000000 (64-bit, prefetchable) 
        Region 5: I/O ports at e000 
        [virtual] Expansion ROM at dd000000 [disabled] 
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 unlimited
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [250 v1] Latency Tolerance Reporting
                Max snoop latency: 3145728ns
                Max no snoop latency: 3145728ns
        Capabilities: [258 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_drm, nvidia

And that’s the lspci output after 12 suspend/resume (the same is shown after suspending 13 times, 14 times, etc.):

01:00.0 3D controller: NVIDIA Corporation GM107M [GeForce GTX 960M] (rev a2)
        Physical Slot: 1
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at dc000000 (32-bit, non-prefetchable) 
        Region 1: Memory at b0000000 (64-bit, prefetchable) 
        Region 3: Memory at c0000000 (64-bit, prefetchable) 
        Region 5: I/O ports at e000 
        [virtual] Expansion ROM at dd000000 [disabled] 
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [250 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Capabilities: [258 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_drm, nvidia

Note that, in addition of Subsystem not being shown, those values have changed to 0:

Capabilities: [250 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns

So conclusions:

  • The problem is related with MSI.
  • There’s some kind of “corruption” after suspending 12 times.
  • That “corruption” seems to happen also when MSI is disabled.
  • That “corruption” will only affect in a negative way while MSI is enabled (but it seems that the “corruption” is still present when it’s disabled).

Since I can suspend/resume more than 12 times now, but since the “corruption” seems to still be present, I consider that issue as partially resolved, but not as fully resolved. Some support to get it fully resolved would be appreciated.

Thank you for your time.

There’s something I missed. With MSI disabled I can still see that error in dmesg after suspending the system 12 times:

[  753.290652] nvidia 0000:01:00.0: Refused to change power state, currently in D3

But if MSI is disabled, everything works, even if the card refuses to change the power state (maybe, when MSI is disabled, there’s some quirk that is executed after that message is shown?).

The message is also shown in any next suspend/resume (it was also present after suspending 13 times, 14 times, etc.).