Greetings, this is my first time posting a topic on the forums, so I hope this issue hasn’t already been covered on here (I couldn’t find any posts about this specific issue).
After upgrading to the 575xx series of drivers, after some time the system loses access to the dedicated GPU completely, with it seemingly having turned off at some point irrecoverably. Rebooting the system allows using the GPU again, until the issue occurs again. The issue happens regardless of whether the GSP firmware is used or not, but in my experience the issue takes longer to happen on average when enabling the GSP firmware.
On 575.64 I get the following dmesg output when the issue happens, indicating some sort of out of memory error:
[ 1137.596055] NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1353
[ 1137.596061] NVRM: nvAssertOkFailedNoLog: Assertion failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from status @ kernel_gsp.c:4615
[ 1137.596075] NVRM: nvAssertOkFailedNoLog: Assertion failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from kgspCreateRadix3(pGpu, pKernelGsp, &pKernelGsp->pSRRadix3Descriptor, NULL, NULL, gspfwSRMeta.sizeOfSuspendResumeData) @ kernel_gsp_tu102.c:1303
[ 1137.604703] nvidia 0000:01:00.0: can't suspend (nv_pmops_runtime_suspend [nvidia] returned -5)
[ 1569.669102] show_signal_msg: 38 callbacks suppressed
[ 1569.669104] steam[15009]: segfault at 0 ip 00000000f7cdcdc3 sp 00000000ffc87f48 error 4 in libc.so.6[89dc3,f7c53000+15f000] likely on CPU 7 (core 12, socket 0)
[ 1569.669111] Code: c9 0f 84 fa 00 00 00 40 a8 03 74 1e 8a 08 38 ca 0f 84 16 01 00 00 84 c9 0f 84 e3 00 00 00 40 eb 09 8d b6 00 00 00 00 83 c0 10 <8b> 08 31 d1 bf ff fe fe fe 01 cf 0f 83 d3 00 00 00 31 cf 81 cf ff
[ 1574.639813] NVRM: Error in service of callback
Previously, with driver version 575.57 with GSP firmware enabled, I had this log instead:
[ 8089.608012] NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1353
[ 8089.608025] NVRM: faultbufCtrlCmdMmuFaultBufferRegisterNonReplayBuf_IMPL: Error allocating client shadow fault buffer for non-replayable faults
[ 8089.704996] NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1353
[ 8089.705010] NVRM: faultbufCtrlCmdMmuFaultBufferRegisterNonReplayBuf_IMPL: Error allocating client shadow fault buffer for non-replayable faults
It seems some assertions were added (or maybe they’re just enabled on testing releases) so maybe this issue is known?
Regardless, I also had this log output on 575.57 with GSP firmware disabled:
[ 99.260465] NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1353
[ 99.260470] NVRM: nvAssertOkFailedNoLog: Assertion failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from status @ kernel_gsp.c:4615
[ 99.260483] NVRM: nvAssertOkFailedNoLog: Assertion failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from kgspCreateRadix3(pGpu, pKernelGsp, &pKernelGsp->pSRRadix3Descriptor, NULL, NULL, gspfwSRMeta.sizeOfSuspendResumeData) @ kernel_gsp_tu102.c:1303
[ 99.267540] nvidia 0000:01:00.0: can’t suspend (nv_pmops_runtime_suspend [nvidia] returned -5)
[ 123.410734] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x00000011 for fn 103!
[ 123.410738] NVRM: rpcRmApiAlloc_GSP: GspRmAlloc failed: hClient=0xc1d0002f; hParent=0x00000000; hObject=0x00000000; hClass=0x00000000; paramsSize=0x00000078; paramsStatus=0x00000000; status=0x00000011
[ 123.410740] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x00000011 for fn 103!
[ 123.410741] NVRM: rpcRmApiAlloc_GSP: GspRmAlloc failed: hClient=0xc1d0002f; hParent=0xc1d0002f; hObject=0xcaf00000; hClass=0x00000080; paramsSize=0x00000038; paramsStatus=0x00000000; status=0x00000011
[ 123.410750] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x00000011 for fn 103!
[ 123.410751] NVRM: rpcRmApiAlloc_GSP: GspRmAlloc failed: hClient=0xc1d0002f; hParent=0xcaf00000; hObject=0xcaf00001; hClass=0x00002080; paramsSize=0x00000004; paramsStatus=0x00000000; status=0x00000011
[ 123.410765] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x00000011 for fn 10!
[ 123.410766] NVRM: rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d0002f; hObject=0xcaf00001; paramsStatus=0x00000000; status=0x00000011
[ 123.410772] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x00000011 for fn 10!
[ 123.410773] NVRM: rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d0002f; hObject=0xcaf00000; paramsStatus=0x00000000; status=0x00000011
[ 123.553780] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x00000011 for fn 103!
[ 123.553785] NVRM: rpcRmApiAlloc_GSP: GspRmAlloc failed: hClient=0xc1d00034; hParent=0x00000000; hObject=0x00000000; hClass=0x00000000; paramsSize=0x00000078; paramsStatus=0x00000000; status=0x00000011
[ 123.553787] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x00000011 for fn 103!
[ 123.553788] NVRM: rpcRmApiAlloc_GSP: GspRmAlloc failed: hClient=0xc1d00034; hParent=0xc1d00034; hObject=0xcaf00000; hClass=0x00000080; paramsSize=0x00000038; paramsStatus=0x00000000; status=0x00000011
[ 123.553799] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x00000011 for fn 103!
[ 123.553800] NVRM: rpcRmApiAlloc_GSP: GspRmAlloc failed: hClient=0xc1d00034; hParent=0xcaf00000; hObject=0xcaf00001; hClass=0x00002080; paramsSize=0x00000004; paramsStatus=0x00000000; status=0x00000011
[ 123.553819] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x00000011 for fn 10!
[ 123.553820] NVRM: rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d00034; hObject=0xcaf00001; paramsStatus=0x00000000; status=0x00000011
[ 123.553827] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x00000011 for fn 10!
[ 123.553828] NVRM: rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d00034; hObject=0xcaf00000; paramsStatus=0x00000000; status=0x00000011
[ 128.409842] NVRM: Error in service of callback
[ 306.996381] NVRM: rm_power_source_change_event: rm_power_source_change_event: Failed to handle Power Source change event, status=0x11
I have seen a number of posts about 575 breaking suspend on many systems, so maybe this is related to that as presumably the GPU turns off when idle, and never returns to operation when needed. Hopefully this issue can be resolved, as it essentially prevents using the dGPU on laptops with power management enabled on the GPU. For the time being, I will revert to driver 570 where the feature worked, but I am willing to upgrade again and test fixes/workarounds as necessary.
Thank you.
OS: Fedora 42
GPU: RTX 4070 Mobile
Host: ASUS TUF F15 (2023)
Kernel Version: 6.14.11-300.fc42.x86_64