The commands nvidia-bug-report.sh and nvidia-smi hang and do not return any output when using RTX4090

OS: Ubuntu 22.04
Driver: 530.30.02 / 575.57.08
GPU: RTX 4090

I’m training a LLM, and although GPU memory usage stays around 70%, the GPU utilization reaches 100%. After running for several hours, some error appears.

I’ve already disabled GSP by executing the following command:

echo "options nvidia NVreg_EnableGpuFirmware=0" > /etc/modprobe.d/nvidia.conf

I then rebooted my system, and verified GSP is disabled with the following command:

nvidia-smi -q | grep -i gsp

It returns:

GSP Firmware Version                  : N/A

Despite this, I still encounter the following error messages:

[ 7783.715385] NVRM: _kgspLogXid119: ********************************************************************************
[ 7783.715391] NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76!
[ 7783.715397] NVRM: subdeviceCtrlCmdMcServiceInterrupts_IMPL: NVRM_RPC: NV2080_CTRL_CMD_MC_SERVICE_INTERRUPTS failed with error 0x65
[ 7784.029575] NVRM: gpuWaitForGfwBootComplete_TU102: failed to wait for GFW_BOOT: (progress 0x3)
[ 7784.029587] NVRM: kgspWaitForGfwBootOk_TU102: failed to wait for GFW boot complete: 0x55 VBIOS version 95.02.3C.C0.7B
[ 7784.029588] NVRM: kgspWaitForGfwBootOk_TU102: (the GPU may be in a bad state and may need to be reset)
[ 7784.029592] NVRM: nvCheckOkFailedNoLog: Check failed: Generic Error: Not ready [NV_ERR_NOT_READY] (0x00000055) returned from kgspWaitForGfwBootOk_HAL(pGpu, pKernelGsp) @ kernel_gsp.c:3669
[ 7784.029627] NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
[ 7784.031804] NVRM: GPU 0000:02:00.0: RmInitAdapter failed! (0x62:0x55:1860)
[ 7784.032954] NVRM: GPU 0000:02:00.0: rm_init_adapter failed, device minor number 0
[ 7829.035135] NVRM: Xid (PCI:0000:83:00): 119, pid=4997, name=nvidia-smi, Timeout after 45s of waiting for RPC response from GPU0 GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x78).
[ 7829.035163] NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 103!
[ 7829.035169] NVRM: rpcRmApiAlloc_GSP: GspRmAlloc failed: hClient=0xc1d0002d; hParent=0x00000000; hObject=0x00000000; hClass=0x00000000; paramsSize=0x00000078; paramsStatus=0x00000000; status=0x00000065
[ 7874.035837] NVRM: Xid (PCI:0000:83:00): 119, pid=4997, name=nvidia-smi, Timeout after 45s of waiting for RPC response from GPU0 GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[ 7874.035861] NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 103!
[ 7874.035867] NVRM: rpcRmApiAlloc_GSP: GspRmAlloc failed: hClient=0xc1d0002d; hParent=0xc1d0002d; hObject=0xa55a0020; hClass=0x00000080; paramsSize=0x00000038; paramsStatus=0x00000000; status=0x00000065
[ 7919.036507] NVRM: Rate limiting GSP RPC error prints for GPU at PCI:0000:83:00 (printing 1 of every 30).  The GPU likely needs to be reset.
[ 7919.036516] NVRM: subdeviceCtrlCmdMcServiceInterrupts_IMPL: NVRM_RPC: NV2080_CTRL_CMD_MC_SERVICE_INTERRUPTS failed with error 0x65

I can not get file of nvidia-bug-report.log.gz because nvidia-bug-report.sh hang.
How can I resolve this issue? Thank you for your help.

Hi,

I have the same issue on my setup, when my eGPU ends dead using linux. smi and nvidia-bug-report.sh hang then also.

I didn’t try. Maybe use:

 nvidia-bug-report.sh -h
 nvidia-bug-report.sh --safe-mode
    Disable certain queries that might hang the system. Useful if you
    experience freezes, high CPU usage, or suspect problematic kernel modules.

Thank you for your response.
I tried running nvidia-bug-report.sh --safe-mode, but the command also hangs and fails to generate the log.
I saw XID 62 and XID 45 errors in the dmesg logs.
I suspect the system freeze might have been caused by GPU overheating, so I have now reduced the GPU load to 70% for testing to see if the issue still occurs.