535.54.03 - System freezes when idle

Hello,

I faced an issue after recent driver update.
I’d like to share some details about it.
Any help would be appreciated.
Thank you.

Issue description

Recent driver update (535.54.03-5) caused one of my systems to freeze when idle. When the screen turns off, it won’t go back. The GPU fans start to get loud and the workstation is not responsible at all. Even the caps-lock diode won’t light up after pressing the caps key. Switching to other TTY (Ctrl + Alt + FX) also stopped working. The input and output devices get completely frozen.
Seems like the freeze issue affects only the X11 server.
The OS still works. I am able to SSH-login to it.
I can stay logged in via SSH, but I cannot restart display manager (SDDM in my case).
Once the issue happens the only way to rollback from it, is to restart the whole machine.
The restart process also takes a long time (guess systemd is waiting for frozen processes).

Technical details below:

  • OS and kernel version:
$ uname -a
Linux msi-kd 5.15.120-1-MANJARO #1 SMP PREEMPT Wed Jul 5 21:45:37 UTC 2023 x86_64 GNU/Linux

(I tried many different kernel versions, the results are always the same)

  • The computer is a notebook with dual GPU (Intel UHD Graphics 630 + NVIDIA GeForce GTX 1660 Ti). It runs in Nvidia-only mode (configuration is managed by optimus-manager script).
$ sudo lshw -C display
  *-display                 
       description: VGA compatible controller
       product: TU116M [GeForce GTX 1660 Ti Mobile]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:155 memory:a4000000-a4ffffff memory:90000000-9fffffff memory:a0000000-a1ffffff ioport:4000(size=128) memory:a5000000-a507ffff
  *-display
       description: VGA compatible controller
       product: CoffeeLake-H GT2 [UHD Graphics 630]
       vendor: Intel Corporation
       physical id: 2
       bus info: pci@0000:00:02.0
       version: 00
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress msi pm vga_controller bus_master cap_list rom
       configuration: driver=i915 latency=0
       resources: irq:149 memory:a3000000-a3ffffff memory:80000000-8fffffff ioport:5000(size=64) memory:c0000-dffff
  • When the issue happens I can see following error logged into the OS journals
kernel: nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c57d:0 2:0:3144:3136

Then, I can see following entries if I try to troubleshot the issue (I think I tried to restart SDDM via SSH).

sddm[1834]: Failed to read display number from pipe
kernel: [drm:drm_new_set_master [drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to grab modeset ownership
  • Some extra info for diagnostics:
$ nvidia-smi 
Wed Jul 19 01:20:07 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1660 Ti     Off | 00000000:01:00.0  On |                  N/A |
| N/A   48C    P3              23W /  80W |   1197MiB /  6144MiB |     10%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      4770      G   /usr/lib/Xorg                               513MiB |
|    0   N/A  N/A      4895      G   /usr/bin/ksmserver                            2MiB |
|    0   N/A  N/A      4897      G   /usr/bin/kded5                                2MiB |
|    0   N/A  N/A      4898      G   /usr/bin/kwin_x11                           138MiB |
|    0   N/A  N/A      4924      G   /usr/bin/plasmashell                         40MiB |
|    0   N/A  N/A      4945      G   ...b/polkit-kde-authentication-agent-1        2MiB |
|    0   N/A  N/A      5039      G   /usr/lib/kdeconnectd                          2MiB |
|    0   N/A  N/A      5045      G   /usr/bin/kaccess                             24MiB |
|    0   N/A  N/A      5099      G   /usr/lib/xdg-desktop-portal-kde               2MiB |
|    0   N/A  N/A      5653      G   /usr/lib/firefox/firefox                    335MiB |
|    0   N/A  N/A      6841      G   /usr/bin/keepassxc                            2MiB |
|    0   N/A  N/A      7825      G   /usr/bin/konsole                              2MiB |
|    0   N/A  N/A      8663      G   /usr/bin/dolphin                              2MiB |
|    0   N/A  N/A      8804      G   /usr/bin/kwrite                               2MiB |
|    0   N/A  N/A      8825      G   /usr/lib/thunderbird/thunderbird            108MiB |
|    0   N/A  N/A     14538      G   /usr/bin/konsole                              2MiB |
|    0   N/A  N/A     14851      G   /usr/bin/dolphin                              2MiB |
|    0   N/A  N/A     15319      G   /usr/bin/konsole                              2MiB |
+---------------------------------------------------------------------------------------+
$ xrandr 
Screen 0: minimum 8 x 8, current 5360 x 2520, maximum 32767 x 32767
HDMI-0 disconnected (normal left inverted right x axis y axis)
DP-0 connected primary 3440x1440+1920+0 (normal left inverted right x axis y axis) 1mm x 1mm
   3440x1440     99.98 + 165.00*  144.00   120.00    59.97  
   2580x1080    164.69    59.94  
   2560x1440     59.95  
   1920x1080    119.88    60.00    59.94    50.00  
   1680x1050     59.95  
   1280x1024     75.02    60.02  
   1280x960      60.00  
   1280x720      60.00    59.94    50.00  
   1152x864      75.00  
   1024x768      75.03    70.07    60.00  
   800x600       75.00    72.19    60.32    56.25  
   720x576       50.00  
   720x480       59.94  
   640x480       75.00    72.81    59.94    59.93  
DP-1 disconnected (normal left inverted right x axis y axis)
eDP-1-1 connected 1920x1080+0+1440 (normal left inverted right x axis y axis) 382mm x 215mm
   1920x1080     60.00*+  59.97    59.96    59.93  
   1680x1050     59.95    59.88  
   1400x1050     59.98  
   1600x900      59.99    59.94    59.95    59.82  
   1280x1024     60.02  
   1400x900      59.96    59.88  
   1280x960      60.00  
   1440x810      60.00    59.97  
   1368x768      59.88    59.85  
   1280x800      59.99    59.97    59.81    59.91  
   1280x720      60.00    59.99    59.86    59.74  
   1024x768      60.04    60.00  
   960x720       60.00  
   928x696       60.05  
   896x672       60.01  
   1024x576      59.95    59.96    59.90    59.82  
   960x600       59.93    60.00  
   960x540       59.96    59.99    59.63    59.82  
   800x600       60.00    60.32    56.25  
   840x525       60.01    59.88  
   864x486       59.92    59.57  
   700x525       59.98  
   800x450       59.95    59.82  
   640x512       60.02  
   700x450       59.96    59.88  
   640x480       60.00    59.94  
   720x405       59.51    58.99  
   684x384       59.88    59.85  
   640x400       59.88    59.98  
   640x360       59.86    59.83    59.84    59.32  
   512x384       60.00  
   512x288       60.00    59.92  
   480x270       59.63    59.82  
   400x300       60.32    56.34  
   432x243       59.92    59.57  
   320x240       60.05  
   360x202       59.51    59.13  
   320x180       59.84    59.32  
  1680x1050 (0x1ca) 146.250MHz -HSync +VSync
        h: width  1680 start 1784 end 1960 total 2240 skew    0 clock  65.29KHz
        v: height 1050 start 1053 end 1059 total 1089           clock  59.95Hz
  1280x1024 (0x1cc) 108.000MHz +HSync +VSync
        h: width  1280 start 1328 end 1440 total 1688 skew    0 clock  63.98KHz
        v: height 1024 start 1025 end 1028 total 1066           clock  60.02Hz
  1280x960 (0x1cd) 108.000MHz +HSync +VSync
        h: width  1280 start 1376 end 1488 total 1800 skew    0 clock  60.00KHz
        v: height  960 start  961 end  964 total 1000           clock  60.00Hz
  1024x768 (0x1d4) 65.000MHz -HSync -VSync
        h: width  1024 start 1048 end 1184 total 1344 skew    0 clock  48.36KHz
        v: height  768 start  771 end  777 total  806           clock  60.00Hz
  800x600 (0x1d7) 40.000MHz +HSync +VSync
        h: width   800 start  840 end  968 total 1056 skew    0 clock  37.88KHz
        v: height  600 start  601 end  605 total  628           clock  60.32Hz
  800x600 (0x1d8) 36.000MHz +HSync +VSync
        h: width   800 start  824 end  896 total 1024 skew    0 clock  35.16KHz
        v: height  600 start  601 end  603 total  625           clock  56.25Hz
  640x480 (0x1dd) 25.175MHz -HSync -VSync
        h: width   640 start  656 end  752 total  800 skew    0 clock  31.47KHz
        v: height  480 start  490 end  492 total  525           clock  59.94Hz
  • Also I’d like to share the output from nvidia-bug-report.sh command. But the website won’t let me upload it (there was an error uploading that file). I will try to do it in a separate message.
    Update: I can’t get file-upload function to work on this forum, sorry. Hope above info will be sufficient.

Best regards,
Kamil.

Just to let you know (if anyone is reading this…) - the issue still persists in version 535.104.05.

Although it seems to behave a little different now.

It is still the same if I leave the device idle for some time and let the power manager turn off the monitors. Then I’ll get the usual error:

nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c57d:0 2:0:3876:3864

The difference is when I try to manually switch TTY terminal (CTRL + ALT + Fx).
Previous driver versions caused I/O devices to freeze (leaving the above error in system logs).
With current version I can successfully switch TTYs and nothing bad happens.

Hello, another update.

This time, on version 545.29.06 I got new error:

nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c57d:0:0:1164

Can’t tell under what conditions this happened as I’m not monitoring this machine anymore. Most likely it was just idling with external monitor plugged in. To be honest, I’m not interested in solving this anymore. For over 10 years I’ve been buying nvidia based machines for me and my customers. But now this is over. I’m not maintaining this anymore. Sorry to say this, but it’s time to say goodbye. BTW. I’m impressed by quality of AMD open source drivers. I literally had to do nothing to get it working. Not even a click to switch the drivers. But this is off-topic, so let’s not continue this talk.

For now the issue remains unfixed.

I am going to keep this machine running and updated. Lets see if it ever gets fixed. If it does - I will surely leave a note. Otherwise this might be my last reply.

Thank you and goodbye.

I wanted to chime in I have a 1080ti and I’m having the same issue. 545 driver is when I noticed the problem, all monitors go black and I have to reboot the system.

Also have this problem with 525 as driver version, kernel 5.15.0-89-generic, ubuntu 22.04. GPU: NVIDIA GeForce RTX 2070 Super with Max-Q Design