565.57.01 driver hangs when game crashes

Driver version: nvidia-open-dkms 565.57.01-2
OS: Arch Linux with CachyOS repos
Kernel version: Linux fezzedone-MSI 6.11.6-2-cachyos #1 SMP PREEMPT_DYNAMIC Fri, 01 Nov 2024 17:52:22 +0000 x86_64 GNU/Linux
Kernel parameters: loglevel=3 quiet nowatchdog zswap.enabled=1 systemd.zram=1 nvme_load=YES mitigations=off nvidia-drm.modeset=1 nvidia.NVreg_EnableGpuFirmware=0 ibt=off gamemode=1 preempt=full irqbalance=128 intel_idle.max_cstate_enforced=0 kvm.ignore_msrs=1 intel_iommu=on iommu=pt i915.enable_guc=3 i915.max_vfs=7 split_lock_detect=off numa_balancing=off pcie_aspm=off mem_sleep_default=deep vm.swappiness=1
GPU: NVIDIA GeForce RTX 3050 Laptop GPU
Issue description: The GPU is sometimes hung when a game or anything else running on it crashes or is closed. The crashed game process becomes «zombified» on the GPU and can’t be killed even with SIGKILL. When the GPU is hung like this, nvidia-smi also hangs and can’t be killed even with SIGKILL either. Relevant system logs from the last time the GPU hang happened:

Nov 05 16:05:30 fezzedone-MSI kernel: NVRM: failed to allocate vmap() page descriptor table!
Nov 05 16:05:30 fezzedone-MSI kernel: NVRM: osMapSystemMemory: failed to create system memory kernel mapping!
Nov 05 16:05:30 fezzedone-MSI kernel: NVRM: nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from memdescMap(*ppMemdescRadix3, 0, allocSize, NV_TRUE, NV_PROTECT_WRITEABLE, &pVaKernel, &pPrivKernel) @ kernel_gsp.c:4213
Nov 05 16:05:30 fezzedone-MSI kernel: NVRM: nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspCreateRadix3(pGpu, pKernelGsp, &pKernelGsp->pSRRadix3Descriptor, NULL, NULL, gspfwSRMeta.sizeOfSuspendResumeData) @ kernel_gsp_tu102.c:1215
Nov 05 16:05:30 fezzedone-MSI kernel: nvidia 0000:01:00.0: can't suspend (nv_pmops_runtime_suspend [nvidia] returned -5)

I tried rebooting my laptop after this, but I noticed the shutdown process was left hanging on the NVIDIA power management service and the «zombified» game process. I ended up doing a hard reboot to reset the GPU.
NVIDIA log from after reboot:
nvidia-bug-report.log.gz (723.2 KB)
Note: I apparently can’t reproduce the crash anymore (even with a triggered SIGSEGV on a game) and rebooted my laptop after the GPU hang happened.

1 Like

My GPU crashed again, although at least there’s no GPU hang this time. nvidia-smi returned the following after the GPU crash:

 > nvidia-smi
Unable to determine the device handle for GPU0000:01:00.0: Unknown Error

… and when I started a game with prime-run, the game screen was black.

NVIDIA log:
nvidia-bug-report.log.gz (565.5 KB)

This also happens to me, tho it only happens sometimes when I suspend
It just makes game laggy and makes nvidia-smi say
“Unable to determine the device handle for GPU0000:01:00.0: Unknown Error”

NVIDIA log after it happened:
nvidia-bug-report.log.gz (252.4 KB)