External monitor freezes when using dedicated GPU

I’ve tried to switch to dGPU as a primary display adapter with the following configuration file:

Section "ServerLayout"
        Identifier "layout"
        Screen 0   "dGPU"
        Option     "AllowNVIDIAGPUScreens" "1"
EndSection

Section "Device"
        Identifier "iGPU"
        Driver     "modesetting"
        BusID      "pci:00:02:00"
EndSection

Section "Screen"
        Identifier "iGPU"
        Device     "iGPU"
EndSection

Section "Device"
        Identifier "dGPU"
        Driver     "nvidia"
EndSection

Section "Screen"
        Identifier "dGPU"
        Device     "dGPU"
EndSection

After writing this configuration to the /etc/X11/xorg.conf.d/gpu.conf and restarting Xorg session (log out and log in again) I’ve found, that there is no freezes while running glxgears, running on discrete GPU. So I’m sure, that low-level discrete GPU driver and rendering engine works as expected in all configuration. But in latter configuration there is no lines with AIGLX string in /var/log/Xorg.0.log. This is the most significant difference. So I’m almost sure, that the root of freezes in “Reverse PRIME” mode lays in Xorg AIGLX implementation or NVIDIA driver interaction with AIGLX.

There is an issue in above setup - it has more laggy GUI and slowly interface rendering. F.e. scrolling in Firefox is not so smooth, as with Intel GPU. I’m very supprised, that Intel integrated GPU with generic modesetting driver is running faster and smoothly than NVIDIA powerful discrete adapter with proprietary drivers. :-\

3 Likes

Very interesting observation, thanks for details @dmakc !
This morning I switched “intel” driver in “device” section of xorg to “modesetting”, like in your example above and was using only offloaded firefox + regular GIMP during the day… after about 6 hours X crashed and I got two black monitors… the laptop display had console invite, but since it didn’t work I had only option to restart laptop with reset button. I found nothing interesting in xorg and system log after restart, so not sure where/how to find the error… I’m sure it should be somewhere.
Switching back to “intel” driver for now.

intel driver (xorg / driver / xf86-video-intel · GitLab) is not actively developed for about one year and more over, it doesn’t properly support video decoding acceleration (f.e. in Firefox). So modesetting driver is preferred over intel at least for me.

Some additions.
According to PRIME - ArchWiki second xrandr provider (radeon in provided example) shows Source Offload, Sink Offload:

$ xrandr --listproviders

Providers: number : 2
Provider 0: id: 0x7d cap: 0xb, Source Output, Sink Output, Sink Offload crtcs: 3 outputs: 4 associated providers: 1 name:Intel
Provider 1: id: 0x56 cap: 0xf, Source Output, Sink Output, Source Offload, Sink Offload crtcs: 6 outputs: 1 associated providers: 1 name:radeon

But my xrandr output has no such properties:

$ xrandr --listproviders 
Providers: number : 2
Provider 0: id: 0x45 cap: 0xf, Source Output, Sink Output, Source Offload, Sink Offload crtcs: 4 outputs: 3 associated providers: 1 name:modesetting
Provider 1: id: 0x270 cap: 0x2, Sink Output crtcs: 4 outputs: 1 associated providers: 1 name:NVIDIA-G0

NVIDIA-G0 has only Sink Output (monitor output capability, as I can understand), but there is no Source Output, nor Offload capabilities. Does any one know why this could happen? I think such missing capabilities forces NVIDIA driver to use AIGLX for rendering Reverse PRIME data to the screen. Or I’m missing something? Could anyone of NVIDIA developers clarify this possible issue?

PS: I’ve tried to add AllowPRIMEDisplayOffloadSink = true to Xorg ServerLayout section, Xorgs log shows it and says, that PRIME Render Offload is supported by Xorg server:

[  6583.789] (**) Option "AllowNVIDIAGpuScreens" "1"
[  6583.789] (**) Option "AllowPRIMEDisplayOffloadSink" "true"
[  6583.789] (**) NVIDIA(G0): Option "UseEdidDpi" "false"
[  6583.789] (**) NVIDIA(G0): Option "DPI" "96 x 96"
[  6583.789] (**) NVIDIA(G0): Enabling 2D acceleration
[  6583.789] (II) Loading sub module "glxserver_nvidia"
[  6583.789] (II) LoadModule: "glxserver_nvidia"
[  6583.789] (II) Loading /usr/lib/xorg/modules/extensions/libglxserver_nvidia.so
[  6583.794] (II) Module glxserver_nvidia: vendor="NVIDIA Corporation"
[  6583.794]    compiled for 1.6.99.901, module version = 1.0.0
[  6583.794]    Module class: X.Org Server Extension
[  6583.794] (II) NVIDIA GLX Module  545.29.02  Thu Oct 26 20:59:27 UTC 2023
[  6583.795] (II) NVIDIA: The X server supports PRIME Render Offload.
[  6583.807] (--) NVIDIA(0): Valid display device(s) on GPU-0 at PCI:1:0:0

… but nothings changes in xrandr providers capabilities output. :-(

2 Likes

I’ve found temporary solution for my case: my laptop supports Thunderbolt 4 with USB-C connector along with Display Port output. So I’ve bought USB-C to HDMI converter, based on some Realtek chipset and now I’m using DP-1 output for external monitor connection:

$ xrandr --listmonitors
Monitors: 2
 0: +*DP-1 1920/598x1080/336+0+0  DP-1
 1: +eDP-1 1920/344x1080/193+1920+0  eDP-1

There is no freezes while I’m using this output to connect external monitor. I’ve made a lot of glxgears tests:

$ __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia __VK_LAYER_NV_optimus=NVIDIA_only glxgears

and there were no freezes at all. This experiment shows, that all above issues are regarded to NVIDIA HDMI output.
As I can understand DRM/PRIME interaction, in bad case NVIDIA driver renders image to the provided by modesetting driver memory framebuffer, which is in its turn read and rendered to the HDMI output by NVIDIA driver. But there are some synchronization issues in this processes, which cause stuttering and freezes. May be this issues are caused by Xorg internals (AIGLX?), but it’s possible to be caused by NVIDIA driver.

2 Likes

Further investigation shows, that if I run

$ __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia __VK_LAYER_NV_optimus=NVIDIA_only glxgears &
$ __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia __VK_LAYER_NV_optimus=NVIDIA_only vkcube &

resize glxgears window to make it larger, place vkcube window over lower-right angle of glxgears window and start vkcube windows resizing external monitor instantly freezes with the latest driver version (545.29.02).

Along with freezing vkcube crashes after I enter text console and return to GUI to unfreeze monitor:

(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x00007ffff7cc7d9f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007ffff7c78f32 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007ffff7c63472 in __GI_abort () at ./stdlib/abort.c:79
#4  0x00007ffff7c63395 in __assert_fail_base (fmt=0x7ffff7dd7a90 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x555555560012 "!err", 
    file=file@entry=0x555555560004 "./cube/cube.c", line=line@entry=744, function=function@entry=0x555555562850 <__PRETTY_FUNCTION__.1> "demo_flush_init_cmd")
    at ./assert/assert.c:92
#5  0x00007ffff7c71e32 in __GI___assert_fail (assertion=assertion@entry=0x555555560012 "!err", file=file@entry=0x555555560004 "./cube/cube.c", 
    line=line@entry=744, function=function@entry=0x555555562850 <__PRETTY_FUNCTION__.1> "demo_flush_init_cmd") at ./assert/assert.c:101
#6  0x000055555555f490 in demo_flush_init_cmd (demo=0x7fffffffcd00) at ./cube/cube.c:744
#7  demo_prepare (demo=demo@entry=0x7fffffffcd00) at ./cube/cube.c:2389
#8  0x000055555555f724 in demo_prepare (demo=0x7fffffffcd00) at ./cube/cube.c:2549
#9  0x0000555555559ca8 in demo_resize (demo=0x7fffffffcd00) at ./cube/cube.c:1094
#10 demo_handle_xcb_event (event=0x555555eff760, demo=0x7fffffffcd00) at ./cube/cube.c:2785
#11 demo_run_xcb (demo=0x7fffffffcd00) at ./cube/cube.c:2805
#12 main (argc=<optimized out>, argv=<optimized out>) at ./cube/cube.c:4396

Failed code is:

    err = vkQueueSubmit(demo->graphics_queue, 1, &submit_info, fence);
    assert(!err);

So vkcube fails to submit command due to some errors in queue processing (queue stucked?).

Before monitor freezed I’ve monitored /sys/kernel/debug/dma_buf/bufinfo file to check shared DMABUF are updated in working state and they are frequently updated as expected:

08388608        00000002        00080007        00000006        i915    00000062        <none>
        write fence:0000:00:02.0 signaled seq 518698 signalled
        Attached Devices:
        0000:01:00.0
Total 1 devices attached

08388608        00000002        00080007        00000006        i915    00000061        <none>
        write fence:0000:00:02.0 signaled seq 518702 signalled
        Attached Devices:
        0000:01:00.0
Total 1 devices attached

00004096        00000002        00080007        00000003        i915    00000050        <none>

and after one second:

08388608        00000002        00080007        00000006        i915    00000062        <none>
        write fence:0000:00:02.0 signaled seq 518886 signalled
        Attached Devices:
        0000:01:00.0
Total 1 devices attached

08388608        00000002        00080007        00000006        i915    00000061        <none>
        write fence:0000:00:02.0 signaled seq 518884 signalled
        Attached Devices:
        0000:01:00.0
Total 1 devices attached

00004096        00000002        00080007        00000003        i915    00000050        <none>

note, that seq 518886 numbers are rapidly increased.

But if external monitor was freezed, this numbers are not incremented at all. 0000:00:02.0 is Intel’s iGPU PCI-E device, 0000:01:00.0 is NVIDIA’s discrete dGPU device. Physically they are performing well, but their DMABUF buffers, related to external monitor framebuffer (?), are not updated due to some error in direct rendering stack of Xorg (driver or something similar) or in some command queue processing issue (race condition?).

I hope my investigations will help NVIDIA to fix this issues in future driver releases as soon as possible.

3 Likes

Thank you so much, @dmakc, for your thorough investigation.

As it happens, I’ve got another laptop now, with a more up-to-date hardware spec, and I still see the exact same behaviour, even after trying every possible combination, as I did with my previous one (broke down at the hinges, and the built-in DFP does not work properly anymore).

My new laptop is a Lenovo Legion Slim 5 16IRH8 (product: 82YA (LENOVO_MT_82YA_BU_idea_FM_Legion Slim 5 16IRH8)), with 16 GB of RAM (for now) and an NVIDIA GeForce RTX 4060 Laptop GPU (8GB of VRAM) and a 13th Gen Intel(R) Core(TM) i7-13700H CPU (20 logical threads), this time running Linux Mint 21.2 (victoria) (based on ubuntu-22.04), xfce flavour, Linux kernel linux-image-6.2.0-37-generic version 6.2.0-37.38~22.04.1 (pulled in by linux-image-generic-hwe-22.04).

According to what I could see from Xorg logs and xrandr output, my laptop has the internal DFP connected to the iGPU and the HDMI connected to the dGPU (NVIDIA).

When using “regular” “PRIME” configuration, things seemed to work without freezing, but the power consumption was around 20W-25W when idling, which, on my 80Ah battery, would give me just over 3 hours (4 if it sat doing absolutely nothing, which is unrealistic).

When using “reverse PRIME” (what I’m using now, and what I’ve used before on my previous laptop), the freezes are exactly as they were before (see my previous post). Power consumption was in the 10W (doing “nothing”) to 15W-17W (moderate web browsing, watching Amazon Prime or YouTube on fullscreen on the 2560x1600 built-in DFP) range, with the NVIDIA GPU “Video Memory” being marked as “inactive”, as I’d expect, since the browsers (Chromium and Firefox) were not using the dGPU.

I’ve also tried with and without “forcing” VSYNC, “PRIME Synchronization” on and off, with and without G-Sync, etc., all with identical “freezing” results.

I’m absolutely fed up with this bug – it’s a complete disappointment. In the meantime, I’ve decided to order an USB-C to HDMI adapter supporting 4K@60Hz (£10 in the UK) to work around this problem, as I’ve recently read your post regarding the “no freezes” outcome when using an USB-C to HDMI adapter. I do have to say (besides the massive “thank you” for your efforts) to NVIDIA that it’s unacceptable to have to incur in extra expenses and the inconvenience of having yet another device “hanging” from your laptop just to have a decent opengl/vulkan experience on an external monitor.

I’ll report back here whether using the USB-C to HDMI adapter has worked for me (it’s going to be a few days until I get it).

Thank you again, @dmakc, for your contributions to this thread. I, for one, very much appreciate it.

I updated to the latest drivers and the freezing issue remains…
I relate to your frustration… I bought a fancy expensive laptop with a fancy and expensive graphic card and now I can’t use it with an external monitor unless I let it consume tons of energy… I would hope such an edge company as NVIDIA solves this soon…

This is also affecting me on my Lenovo Legion 5. @amrits what is the status for bug 4322356?

Hey Guys.

I have the same problem with my HP Zbook Fury 16 G9 with an Nvidia RTX A2000 8GB GPU installed. I’m using Arch Linux and my Laptop is connected via USB C on my HP Thunderbolt G4 280W Dock.
On the Dock there are two Screens on the Display Port connected. But one screen has an HDMI to Displayport Adapter installed.

I’m using Xorg with MATE. Before I used the Docking Station, just one of these Screens was connected to the Laptops HDMI Port, and I had the Issue, that the Screen restarted itself. On the Laptop itself there is no Problem. As well not with Prime or switching from the Intel GPU to the NVidia one.
It’s just the Problem with an external Monitor. On Windows 11 everything works fine. So, it has something to do with the Linux Nvidia Drivers. I’m using the latest nvidia-dkms Packages from Arch.

Nvidia Team, please fix this stuff very soon. If you need any Logfiles please write me.
PS: Sorry for my english, I’m not a native speaker. :)

Bye
-Roman

UPDATE: I can see on Windows 11 that when I plug in the USB-C Connector, the Intel UHD has nothing to do at all, the Nvidia GPU takes over the graphical part. Maybe thats helpful? Because on my Linux Installation the Intel GPU takes control until i start PRIME.

I’ve written simple script to capture DRM, sysprof and xinput debug information while HDMI output is getting freezes: nvidia-debug.tar.gz (655 Bytes)
After script is started I resize glxgears to make it larger, move vkcube window over glxgears window and make intensive resizing of vkcube window to get freezes. All freezes I’ve got are about two seconds long and with latest NVIDIA drivers version they are almost always disappears if I finish resizing and HDMI output/Xorg/glxgears/vkcube continues to work. But sometimes HDMI output is totally frozen and I’ve got logs exactly for this case. Sysprof capture looks in following manner:

There is noticeable pause in DRM processing (for about 2 seconds) and this pause matches some freeze in kernel space stack traces.

I’ve attached captured logs:
dmesg-20231203.121321.log.gz (386.6 KB)
sysprof-20231203.121321.cap.gz (4.2 MB)
xinput-20231203.121321.log.gz (124.6 KB)

I’m using xinput log to find out a start of freezing, cause Xorg event processing usually stops at this moment.

Another sysprof capture with multiple freezes:

It looks like glxgears is calling __sched_yield while waiting for something in /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.545.29.02 library function:

      SELF      TOTAL    FUNCTION
[   0,00%] [  47,38%]      [glxgears]
[   2,76%] [  43,48%]        __sched_yield
[   0,00%] [  40,72%]          - - kernel - -
[   0,72%] [  33,87%]            entry_SYSCALL_64_after_hwframe
[   6,05%] [   6,05%]            entry_SYSCALL_64
[   0,39%] [   0,39%]            entry_SYSCALL_64_safe_stack
[   0,25%] [   0,25%]            syscall_return_via_sysret
[   0,17%] [   0,17%]            entry_SYSRETQ_unsafe_stack
[   3,80%] [   3,80%]        In file /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.545.29.02
[   0,09%] [   0,09%]        In file /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.545.29.02

At the same time kernel is busy by nvkms_call_rm IOCTL processing:

      SELF      TOTAL    FUNCTION
[   0,00%] [  47,90%]      [/usr/lib/xorg/Xorg -nolisten tcp -auth /var/run/sddm/{9264ef07-5f53-49b6-a8d3-37feebe247ab} -background none -noreset -displayfd 17 -seat seat0 vt7]
[   0,00%] [  47,54%]        ioctl
[   0,00%] [  47,54%]          - - kernel - -
[   0,00%] [  45,83%]            _nv002771kms
[   0,00%] [  45,83%]              nvkms_call_rm
[   0,01%] [  45,83%]                rm_kernel_rmapi_op
[   0,05%] [  45,82%]                  _nv000577rm

which in turn is called by /usr/lib/xorg/modules/drivers/nvidia_drv.so Xorg driver.

Another bunch of logs is attached:
dmesg-20231203.130339.log.gz (297.4 KB)
sysprof-20231203.130339.cap.gz (3.0 MB)
xinput-20231203.130339.log.gz (85.3 KB)

4 Likes

I get these freezes and I’m using a USB-C hub with an HDMI port, so I’m curious to know if your solution works (although I do not have Thunderbolt 4)

This is an amazing effort, I hope that the NVIDIA developers can debug the problem as this seems to affect a lot of users; and unlike you, I do not have a solution via Thunderbolt 4: I have USB-C to HDMI (via a hub) but I get the same freezes.

1 Like

Quick update: I’ve received my USB-C to HDMI adapter, and I’ve used it for a few hours straight. I did not experience a single freeze or stutter, and everything worked as I hoped it would when I first started using this laptop.

There is something that might be significant, too. As the NVIDIA driver worked flawlessly with the built-in laptop screen and the USB-C to HDMI adapter I’ve got worked just as well, I looked at any similarities that could explain this. I have found that both working “outputs” (built-in laptop screen and the external monitor’s HDMI connection through the USB-C to HDMI adapter) are bound/connected to the integrated GPU (provided by the Intel® Core™ i7-13700HX Processor, Raptor Lake architecture), so I’m left wondering if this issue is related to the fact that the HDMI port exhibiting this problematic behaviour is bound/connected to the discrete (NVIDIA) video card.

I can provide a few command outputs from my laptop if anybody is interested. Just ask if you are.

What is the model of your USB-C Hub? I think, it is not simple DisplayPort (over USB-C) to HDMI convertor, but it has some kind of Thunderbolt-4 (PCI-E) connected video adapter (CRTC). Please, provide xrandr --listproviders, xrandr --listactivemonitors and xrandr --verbose commands output logs to get better understanding of what is happening in your case.

2 Likes

Now I’m definitely sure, that the root of our problems is HDMI connector of discrete NVIDIA video adapter, which is improperly managed by NVIDIA/i915 DRM drivers. When i915 driver is a master in our systems, then we’ve got an issues (freezes) with NVIDIA’s DRM (PRIME) rendered output, passed by the kernel DRM directly to the NVIDIA’s HDMI output (NVIDIA discrete adapter framebuffer and CRTC). But if NVIDIA’s DRM (PRIME) rendered data is output by i915 driver itself, then no freezes occurs. The same behavior is also in case if NVIDIA video adapter is a master and at the same time it controls its HDMI output - we got no freezes, cause there is no synchronization between NVIDIA<=>i915 is needed.

The main question is how could we provide NVIDIA with enough information to solve this issue. It seems that nvidia-bug-report.sh report is not enough and nvidia* kernel modules should be build with extended debug information along with enabling extended debug output in user-space Xorg drivers and libraries.

3 Likes

I’ve tried to compare sudo lspci -vvv output before running tests and just after freezes and there is unexpected status flag TransPend+ in latter case:

0000:01:00.0 VGA compatible controller: NVIDIA Corporation GA107M [GeForce RTX 3050 Ti Mobile] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Acer Incorporated [ALI] GA107M [GeForce RTX 3050 Ti Mobile]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 205
	IOMMU group: 16
	Region 0: Memory at 61000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at 6000000000 (64-bit, prefetchable) [size=4G]
	Region 3: Memory at 6100000000 (64-bit, prefetchable) [size=32M]
	Region 5: I/O ports at 4000 [size=128]
	Expansion ROM at 62080000 [virtual] [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee00a18  Data: 0000
	Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend+
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <16us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s (downgraded), Width x8 (downgraded)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [b4] Vendor Specific Information: Len=14 <?>
	Capabilities: [100 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
			Status:	NegoPending- InProgress-
	Capabilities: [250 v1] Latency Tolerance Reporting
		Max snoop latency: 34326183936ns
		Max no snoop latency: 34326183936ns
	Capabilities: [258 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=500us
	Capabilities: [128 v1] Power Budgeting <?>
	Capabilities: [420 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [bb0 v1] Physical Resizable BAR
		BAR 0: current size: 16MB, supported: 16MB
		BAR 1: current size: 4GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB
		BAR 3: current size: 32MB, supported: 32MB
	Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [d00 v1] Lane Margining at the Receiver <?>
	Capabilities: [e00 v1] Data Link Feature <?>
	Kernel driver in use: nvidia
	Kernel modules: nouveau, nvidia_drm, nvidia

Pay attention to the DevSta: string:

DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend+

According to the PCI Express Base Specification this status bit has following meaning:
trans-pend-image
After some time is passed this bit is cleared without getting Completion Timeout (CpltTO) status bit in Capabilities: [420 v2] Advanced Error Reporting section.

So may be this freezes are not only software related, but there may be some hardware issues.

3 Likes

Wow man, NVIDIA should definitely hire you :)
Thanks a lot for your detailed tests and explanations! Let’s hope we have a fix soon…

This could also be related to my issue on TB3 Dock monitor connection: 525.89 brings back Thunderbolt 3 connected displays flicker and suspend issues

We’re about 3-ish months from a birthday on this issue guys. Please?

Hi, @roliverio, welcome to the party.

I’d say this issue is much older than 3 months. I’ve experienced these “freezes” ever since I’ve got my previous laptop (around July 2021, IIRC). I didn’t use an external monitor much those days, but when I did, I noticed those “freezes” every time, even though I can’t place exactly when was the first time I’ve seen this issue happening, and with which NVIDIA driver version.

1 Like

i updated to 546.29 and its still freezing in windows. Is this issue common in windows too ?