On my ASUS ROG Strix G16 laptop running Linux Mint, the system randomly hard-freezes and the screen goes black during deep learning training on the NVIDIA GPU.
When this happens, the display goes dark, I’m dropped to a black TTY-like screen (or nothing at all), and I see repeated NVIDIA-related errors. I cannot switch TTYs or recover; I have to force power off by holding the power button.
This only happens under heavy CUDA load (deep learning training). Normal desktop usage is fine.
Hardware :
- Laptop: ASUS ROG Strix G16
- dGPU: NVIDIA RTX 5070 Ti Laptop GPU
- PCI ID:
10de:2f58
- PCI ID:
- iGPU: AMD Raphael (amdgpu)
- CPU: AMD Ryzen 9 8940HX
- Hybrid graphics: AMD iGPU + NVIDIA dGPU (no external GPU)
Software :
- OS : Linux Mint 22.2 zara
- Kernel: 6.14.0-29-generic
NVIDIA drivers tested:
nvidia-driver-580-open(recommended by Driver Manager)nvidia-driver-570-open
Display stack:
-
amdgpufor iGPU -
nvidia(open kernel modules) for dGPU
During a deep learning training run on the dGPU, the laptop suddenly goes black. When it drops to a text console, I see errors like:
Nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c67d:0 2:0:4048:4044
snd_hda_intel 0000:01:00.1: unable to change power state from D3cold to D0, device is inaccessible
From dmesg I see repeated messages like:
nvidia 0000:01:00.0: Enabling HDA controller
NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:2f58)
NVRM: installed in this system requires use of the NVIDIA open kernel modules.
[drm:nv_drm_dev_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
The GPU temperature during training stays around ~80°C (according to nvidia-smi) before the crash. There is no thermal throttling or obvious overheating.
After the crash, the system is completely unresponsive. The only way out is a hard power-off.
Current driver state:
$ lsmod | grep -E 'nvidia|nouveau'
nvidia_uvm 2076672 0
nvidia_drm 135168 0
nvidia_modeset 1638400 1 nvidia_drm
nvidia 104071168 2 nvidia_uvm,nvidia_modeset
nvidia_wmi_ec_backlight 12288 0
drm_ttm_helper 16384 2 amdgpu,nvidia_drm
video 77824 5 nvidia_wmi_ec_backlight,asus_wmi,amdgpu,asus_nb_wmi,nvidia_modeset
wmi 28672 5 video,nvidia_wmi_ec_backlight,asus_wmi,wmi_bmof,mfd_aaeon
$ lspci -k | grep -A 3 -E "VGA|3D"
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2f58 (rev a1)
Subsystem: ASUSTeK Computer Inc. Device 30f9
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
--
69:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raphael (rev d8)
Subsystem: ASUSTeK Computer Inc. Raphael
Kernel driver in use: amdgpu
Kernel modules: amdgpu
$ ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195, 0 Nov 28 11:08 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Nov 28 11:08 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Nov 28 11:08 /dev/nvidia-modeset
crw-rw-rw- 1 root root 507, 0 Nov 28 11:08 /dev/nvidia-uvm
crw-rw-rw- 1 root root 507, 1 Nov 28 11:08 /dev/nvidia-uvm-tools
Is this a known issue with RTX 50-series laptop GPUs (PCI ID 10de:2f58) and the open kernel modules on Linux?
Are there recommended kernel parameters, driver versions, or power-management settings for ROG Strix G16 + RTX 5070 Ti Laptop on Linux to avoid:
-
Nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress -
snd_hda_intel 0000:01:00.1: unable to change power state from D3cold to D0
i have attached the nvidia-bug-report.sh report .
nvidia-bug-report.log.gz (346.2 KB)
Any guidance on how to stabilize this GPU under heavy CUDA workloads on Linux would be very appreciated.