Driver version: nvidia-open-dkms 565.57.01-2
OS: Arch Linux with CachyOS repos
Kernel version: Linux fezzedone-MSI 6.11.6-2-cachyos #1 SMP PREEMPT_DYNAMIC Fri, 01 Nov 2024 17:52:22 +0000 x86_64 GNU/Linux
Kernel parameters: loglevel=3 quiet nowatchdog zswap.enabled=1 systemd.zram=1 nvme_load=YES mitigations=off nvidia-drm.modeset=1 nvidia.NVreg_EnableGpuFirmware=0 ibt=off gamemode=1 preempt=full irqbalance=128 intel_idle.max_cstate_enforced=0 kvm.ignore_msrs=1 intel_iommu=on iommu=pt i915.enable_guc=3 i915.max_vfs=7 split_lock_detect=off numa_balancing=off pcie_aspm=off mem_sleep_default=deep vm.swappiness=1
GPU: NVIDIA GeForce RTX 3050 Laptop GPU
Issue description: The GPU is sometimes hung when a game or anything else running on it crashes or is closed. The crashed game process becomes «zombified» on the GPU and can’t be killed even with SIGKILL
. When the GPU is hung like this, nvidia-smi
also hangs and can’t be killed even with SIGKILL
either. Relevant system logs from the last time the GPU hang happened:
Nov 05 16:05:30 fezzedone-MSI kernel: NVRM: failed to allocate vmap() page descriptor table!
Nov 05 16:05:30 fezzedone-MSI kernel: NVRM: osMapSystemMemory: failed to create system memory kernel mapping!
Nov 05 16:05:30 fezzedone-MSI kernel: NVRM: nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from memdescMap(*ppMemdescRadix3, 0, allocSize, NV_TRUE, NV_PROTECT_WRITEABLE, &pVaKernel, &pPrivKernel) @ kernel_gsp.c:4213
Nov 05 16:05:30 fezzedone-MSI kernel: NVRM: nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspCreateRadix3(pGpu, pKernelGsp, &pKernelGsp->pSRRadix3Descriptor, NULL, NULL, gspfwSRMeta.sizeOfSuspendResumeData) @ kernel_gsp_tu102.c:1215
Nov 05 16:05:30 fezzedone-MSI kernel: nvidia 0000:01:00.0: can't suspend (nv_pmops_runtime_suspend [nvidia] returned -5)
I tried rebooting my laptop after this, but I noticed the shutdown process was left hanging on the NVIDIA power management service and the «zombified» game process. I ended up doing a hard reboot to reset the GPU.
NVIDIA log from after reboot:
nvidia-bug-report.log.gz (723.2 KB)
Note: I apparently can’t reproduce the crash anymore (even with a triggered SIGSEGV
on a game) and rebooted my laptop after the GPU hang happened.