Random System Freeze with NVIDIA RTX 2000 Ada Generation Laptop GPU

Dear NVIDIA Support Team,

I am writing to report a critical issue that I have been encountering with my NVIDIA RTX 2000 Ada Generation Laptop GPU, installed in a Lenovo ThinkPad P1 Gen 6. The problem leads to a “scheduling while atomic” kernel BUG followed by system freeze on my Linux system running kernel versions 6.8.1, 6.7.x, 6.6.x, and 6.1.x.

Description of the Problem
Randomly during system operation, the kernel gives the error message “scheduling while atomic”. This error occurs seemingly at random intervals and under varying system loads. Subsequently, at some point, the system becomes unresponsive and necessitates a hard reboot to regain functionality.

Steps to Reproduce

  1. Operate the system under normal conditions.
  2. Encounter a random kernel BUG with “scheduling while atomic”.
  3. System becomes frozen and unresponsive at some random time

System Information

  1. GPU: NVIDIA Corporation AD107GLM [RTX 2000 Ada Generation Laptop GPU] (rev a1) on Lenovo Thinkpad P1 Gen 6
  2. Linux Distribution: Gentoo Linux - Kernel Version: 6.8.1 (Also affects kernel versions 6.7.x, 6.6.x, and 6.1.x)
  3. Nvidia driver: 535.161.07 (but issue with any versions!)

Additional Information
The issue persists across multiple kernel versions, indicating it is not specific to a particular kernel release.
I have examined the system logs and have identified the occurrence of the “scheduling while atomic” error as the primary issue leading to the kernel panic and subsequent system freeze.
No specific system activity or workload triggers the error; it happens seemingly at random.
I have ensured that the GPU drivers are up to date and have attempted to reinstall them without resolving the issue, but no matter which NVIDIA driver version I install, the bug consistently persists.

Attached Logs
nvidia-bug-report.log.gz (936.4 KB)

This issue severely affects the usability and stability of my system, and I kindly request your assistance in resolving it promptly. If there are any additional diagnostic steps or information required from my end, please let me know, and I will gladly provide it.

Sincerely,
Sarah Salzstein

Please upgrade to the latest 550 driver and set kernel parameter/module option nvidia-drm.modeset=1

@generix Thanks for the suggestion. I’ve upgraded to the latest 550 driver and set the kernel parameter/module option nvidia-drm.modeset=1 as you recommended. So far, the “scheduling while atomic” does not occur. However, I got the following error in the meantime:

[Fri Mar 22 00:05:57 2024] [drm:nv_drm_semsurf_fence_create_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to create sync file from fence on ctx 0x00000001

I’ll let you know once the “scheduling while atomic” bug occurs again, as it happens at random intervals.

Please create a new nvidia-bug-report.log in the current state, the 535 driver had some broken thermal settings on your gpu, I want check if at least that has changed.

@generix Sure. Here you go:
nvidia-bug-report.log.gz (604.0 KB)
Also, screen is flickering with the newest 550.X driver.

Unfortunately, the bug-report.logs don’t contain any xorg logs or config files so I can only guess what config you’re running

  • hybrid graphics intel/nvidia
  • xorg
  • nvidia set as primary gpu
  • non-compositing WM

I suspect the flicker you’re observing might be a bug in the 550 driver related to non-compositing WMs, please make sure

sudo cat /sys/module/nvidia_drm/parameters/modeset

returns “Y”

@generix Thank you for your reply. I see. Well, I don’t utilize hybrid graphics. I directly opt for Nvidia and have disabled the Intel card. Yes, I am using a non-compositing window manager (dwm). Also, I have ‘Y’ set in /sys/module/nvidia_drm/parameters/modeset . Here is my Xorg configuration:

#Section "Device"
#    Identifier  "Intel Graphics"
#    Driver      "intel"
#    Option      "AccelMethod"     "sna"
#EndSection

Section "ServerLayout"
  Identifier  "layout"
  Screen 0    "nvidia"
  Inactive    "intel"
EndSection

Section "Device"
  Identifier  "nvidia"
  Driver      "nvidia"
  BusID       "01:00:0"
  #Option      "AccelMethod"     "glamor"
  Option      "DRI"             "3"
  Option      "Monitor-eDP-1"   "eDP-1"
  Option      "Monitor-HDMI-2"  "HDMI-2"
  Option      "ModeValidation" "NoTotalSizeCheck"
EndSection

Section "Screen"
  Identifier    "nvidia"
  Device        "nvidia"
  Option        "AllowEmptyInitialConfiguration"
  Option        "SLI" "AA"
  Option        "DPMS"
  DefaultDepth  24
  #Option        "AccelMethod"     "glamor"
EndSection

Section "Monitor"
  Identifier    "Monitor0"
  Option        "DPMS"
  #Modeline      "2560x1440_60.00"  312.25  2560 2752 3024 3488  1440 1443 1448 1493 -hsync +vsync
EndSection

Section "Device"
  Identifier  "intel"
  Driver      "modesetting"
  BusID       "00:02:0"
  Option      "AccelMethod"   "glamor"
EndSection

Section "Screen"
  Identifier  "intel"
  Device      "intel"
EndSection

Hello
I have exactly same laptop running Kubuntu 23.10 (with KDE plasma)
I had no problems running Nvidia drivers up until 550.x, where I experienced system freeze on startup after driver install, which was fixed by switching to hybrid mode in BIOS and to Prime mode using nvidia-settings app
Since then I have no problems
Also, composer in KDE switched off
As @generix suggested, I have modest set to 1:
options nvidia-drm modeset=1

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.