After running under full load for 82 hours, the desktop application gnome crashes and does not recover automatically, leaving the monitor without any display. The information queried by nvidia-smi indicates an abnormal status. Currently, this issue occurs sporadically on one machine without any discernible pattern.
The following GPU driver error appears in dmesg:
[294947.067674] NVRM: nvAssertOkFailedNoLog: Assertion failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from pRmApi->Control( pRmApi, hClient, hDevice, NV0080_CTRL_CMD_INTERNAL_MEMSYS_SET_ZBC_REFERENCED, ¶ms, sizeof(params)) @ mem_mgr_gm107.c:283
After the stress test, querying the GPU status via nvidia-smi shows abnormal power and utilization values. Executing the nvidia-smi command, the power is displayed as 18W, and utilization reaches 96%. (On a normally idle machine, the power is 4W, and utilization is 0%.)
After executing the following two commands:
systemctl isolate multi-user.target
systemctl isolate graphical.target
The nvidia-smi query information is normal, and the monitor displays correctly.
Please help analyze the cause of this issue.