OS: Ubuntu 22.04
Driver: 530.30.02 / 575.57.08
GPU: RTX 4090
I’m training a LLM, and although GPU memory usage stays around 70%, the GPU utilization reaches 100%. After running for several hours, some error appears.
I’ve already disabled GSP by executing the following command:
echo "options nvidia NVreg_EnableGpuFirmware=0" > /etc/modprobe.d/nvidia.conf
I then rebooted my system, and verified GSP is disabled with the following command:
nvidia-smi -q | grep -i gsp
It returns:
GSP Firmware Version : N/A
Despite this, I still encounter the following error messages:
[ 7783.715385] NVRM: _kgspLogXid119: ********************************************************************************
[ 7783.715391] NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76!
[ 7783.715397] NVRM: subdeviceCtrlCmdMcServiceInterrupts_IMPL: NVRM_RPC: NV2080_CTRL_CMD_MC_SERVICE_INTERRUPTS failed with error 0x65
[ 7784.029575] NVRM: gpuWaitForGfwBootComplete_TU102: failed to wait for GFW_BOOT: (progress 0x3)
[ 7784.029587] NVRM: kgspWaitForGfwBootOk_TU102: failed to wait for GFW boot complete: 0x55 VBIOS version 95.02.3C.C0.7B
[ 7784.029588] NVRM: kgspWaitForGfwBootOk_TU102: (the GPU may be in a bad state and may need to be reset)
[ 7784.029592] NVRM: nvCheckOkFailedNoLog: Check failed: Generic Error: Not ready [NV_ERR_NOT_READY] (0x00000055) returned from kgspWaitForGfwBootOk_HAL(pGpu, pKernelGsp) @ kernel_gsp.c:3669
[ 7784.029627] NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
[ 7784.031804] NVRM: GPU 0000:02:00.0: RmInitAdapter failed! (0x62:0x55:1860)
[ 7784.032954] NVRM: GPU 0000:02:00.0: rm_init_adapter failed, device minor number 0
[ 7829.035135] NVRM: Xid (PCI:0000:83:00): 119, pid=4997, name=nvidia-smi, Timeout after 45s of waiting for RPC response from GPU0 GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x78).
[ 7829.035163] NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 103!
[ 7829.035169] NVRM: rpcRmApiAlloc_GSP: GspRmAlloc failed: hClient=0xc1d0002d; hParent=0x00000000; hObject=0x00000000; hClass=0x00000000; paramsSize=0x00000078; paramsStatus=0x00000000; status=0x00000065
[ 7874.035837] NVRM: Xid (PCI:0000:83:00): 119, pid=4997, name=nvidia-smi, Timeout after 45s of waiting for RPC response from GPU0 GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[ 7874.035861] NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 103!
[ 7874.035867] NVRM: rpcRmApiAlloc_GSP: GspRmAlloc failed: hClient=0xc1d0002d; hParent=0xc1d0002d; hObject=0xa55a0020; hClass=0x00000080; paramsSize=0x00000038; paramsStatus=0x00000000; status=0x00000065
[ 7919.036507] NVRM: Rate limiting GSP RPC error prints for GPU at PCI:0000:83:00 (printing 1 of every 30). The GPU likely needs to be reset.
[ 7919.036516] NVRM: subdeviceCtrlCmdMcServiceInterrupts_IMPL: NVRM_RPC: NV2080_CTRL_CMD_MC_SERVICE_INTERRUPTS failed with error 0x65
I can not get file of nvidia-bug-report.log.gz because nvidia-bug-report.sh hang.
How can I resolve this issue? Thank you for your help.