[575.64] NVRM Out of memory error causes dGPU to not be usable after some time

Greetings, this is my first time posting a topic on the forums, so I hope this issue hasn’t already been covered on here (I couldn’t find any posts about this specific issue).

After upgrading to the 575xx series of drivers, after some time the system loses access to the dedicated GPU completely, with it seemingly having turned off at some point irrecoverably. Rebooting the system allows using the GPU again, until the issue occurs again. The issue happens regardless of whether the GSP firmware is used or not, but in my experience the issue takes longer to happen on average when enabling the GSP firmware.

On 575.64 I get the following dmesg output when the issue happens, indicating some sort of out of memory error:

[ 1137.596055] NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1353
[ 1137.596061] NVRM: nvAssertOkFailedNoLog: Assertion failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from status @ kernel_gsp.c:4615
[ 1137.596075] NVRM: nvAssertOkFailedNoLog: Assertion failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from kgspCreateRadix3(pGpu, pKernelGsp, &pKernelGsp->pSRRadix3Descriptor, NULL, NULL, gspfwSRMeta.sizeOfSuspendResumeData) @ kernel_gsp_tu102.c:1303
[ 1137.604703] nvidia 0000:01:00.0: can't suspend (nv_pmops_runtime_suspend [nvidia] returned -5)
[ 1569.669102] show_signal_msg: 38 callbacks suppressed
[ 1569.669104] steam[15009]: segfault at 0 ip 00000000f7cdcdc3 sp 00000000ffc87f48 error 4 in libc.so.6[89dc3,f7c53000+15f000] likely on CPU 7 (core 12, socket 0)
[ 1569.669111] Code: c9 0f 84 fa 00 00 00 40 a8 03 74 1e 8a 08 38 ca 0f 84 16 01 00 00 84 c9 0f 84 e3 00 00 00 40 eb 09 8d b6 00 00 00 00 83 c0 10 <8b> 08 31 d1 bf ff fe fe fe 01 cf 0f 83 d3 00 00 00 31 cf 81 cf ff
[ 1574.639813] NVRM: Error in service of callback

Previously, with driver version 575.57 with GSP firmware enabled, I had this log instead:

[ 8089.608012] NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1353
[ 8089.608025] NVRM: faultbufCtrlCmdMmuFaultBufferRegisterNonReplayBuf_IMPL: Error allocating client shadow fault buffer for non-replayable faults
[ 8089.704996] NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1353
[ 8089.705010] NVRM: faultbufCtrlCmdMmuFaultBufferRegisterNonReplayBuf_IMPL: Error allocating client shadow fault buffer for non-replayable faults

It seems some assertions were added (or maybe they’re just enabled on testing releases) so maybe this issue is known?

Regardless, I also had this log output on 575.57 with GSP firmware disabled:

[ 99.260465] NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1353
[ 99.260470] NVRM: nvAssertOkFailedNoLog: Assertion failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from status @ kernel_gsp.c:4615
[ 99.260483] NVRM: nvAssertOkFailedNoLog: Assertion failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from kgspCreateRadix3(pGpu, pKernelGsp, &pKernelGsp->pSRRadix3Descriptor, NULL, NULL, gspfwSRMeta.sizeOfSuspendResumeData) @ kernel_gsp_tu102.c:1303
[ 99.267540] nvidia 0000:01:00.0: can’t suspend (nv_pmops_runtime_suspend [nvidia] returned -5)
[ 123.410734] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x00000011 for fn 103!
[ 123.410738] NVRM: rpcRmApiAlloc_GSP: GspRmAlloc failed: hClient=0xc1d0002f; hParent=0x00000000; hObject=0x00000000; hClass=0x00000000; paramsSize=0x00000078; paramsStatus=0x00000000; status=0x00000011
[ 123.410740] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x00000011 for fn 103!
[ 123.410741] NVRM: rpcRmApiAlloc_GSP: GspRmAlloc failed: hClient=0xc1d0002f; hParent=0xc1d0002f; hObject=0xcaf00000; hClass=0x00000080; paramsSize=0x00000038; paramsStatus=0x00000000; status=0x00000011
[ 123.410750] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x00000011 for fn 103!
[ 123.410751] NVRM: rpcRmApiAlloc_GSP: GspRmAlloc failed: hClient=0xc1d0002f; hParent=0xcaf00000; hObject=0xcaf00001; hClass=0x00002080; paramsSize=0x00000004; paramsStatus=0x00000000; status=0x00000011
[ 123.410765] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x00000011 for fn 10!
[ 123.410766] NVRM: rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d0002f; hObject=0xcaf00001; paramsStatus=0x00000000; status=0x00000011
[ 123.410772] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x00000011 for fn 10!
[ 123.410773] NVRM: rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d0002f; hObject=0xcaf00000; paramsStatus=0x00000000; status=0x00000011
[ 123.553780] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x00000011 for fn 103!
[ 123.553785] NVRM: rpcRmApiAlloc_GSP: GspRmAlloc failed: hClient=0xc1d00034; hParent=0x00000000; hObject=0x00000000; hClass=0x00000000; paramsSize=0x00000078; paramsStatus=0x00000000; status=0x00000011
[ 123.553787] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x00000011 for fn 103!
[ 123.553788] NVRM: rpcRmApiAlloc_GSP: GspRmAlloc failed: hClient=0xc1d00034; hParent=0xc1d00034; hObject=0xcaf00000; hClass=0x00000080; paramsSize=0x00000038; paramsStatus=0x00000000; status=0x00000011
[ 123.553799] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x00000011 for fn 103!
[ 123.553800] NVRM: rpcRmApiAlloc_GSP: GspRmAlloc failed: hClient=0xc1d00034; hParent=0xcaf00000; hObject=0xcaf00001; hClass=0x00002080; paramsSize=0x00000004; paramsStatus=0x00000000; status=0x00000011
[ 123.553819] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x00000011 for fn 10!
[ 123.553820] NVRM: rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d00034; hObject=0xcaf00001; paramsStatus=0x00000000; status=0x00000011
[ 123.553827] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x00000011 for fn 10!
[ 123.553828] NVRM: rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d00034; hObject=0xcaf00000; paramsStatus=0x00000000; status=0x00000011
[ 128.409842] NVRM: Error in service of callback
[ 306.996381] NVRM: rm_power_source_change_event: rm_power_source_change_event: Failed to handle Power Source change event, status=0x11

I have seen a number of posts about 575 breaking suspend on many systems, so maybe this is related to that as presumably the GPU turns off when idle, and never returns to operation when needed. Hopefully this issue can be resolved, as it essentially prevents using the dGPU on laptops with power management enabled on the GPU. For the time being, I will revert to driver 570 where the feature worked, but I am willing to upgrade again and test fixes/workarounds as necessary.

Thank you.

OS: Fedora 42
GPU: RTX 4070 Mobile
Host: ASUS TUF F15 (2023)
Kernel Version: 6.14.11-300.fc42.x86_64

1 Like

Hi @gimzie , Could you try after applying @aplattner 's patch : suspend/resume fixes · NVIDIA/open-gpu-kernel-modules@c7e7213 · GitHub . This fix will be present in the next driver release.

Applied the patch and attempted to suspend/unsuspend; it seems like it works! The dmesg log still reports an “out of memory” error, but the suspend does not seem to fail and the GPU is still accessible as normal. I also confirmed that the open kernel modules are in use, so it appears that the patch solves the problem.

I will report if the issue occurs again with this patch active. Otherwise, looking forward to the next driver release. 😀

1 Like

Update: Unfortunately, while general GPU use seems fine with this patch and software is able to render using the GPU, i.e. vkcube or glxgears, CUDA functionality seems to intermittently disappear. As a result, NVENC encoding is made unavailable in applications such as ffmpeg.

I was able to get NVENC support to return once with an ffmpeg test output command, but I was not able to have this occur again, and am currently unable to get it back without a reboot. dmesg is being flooded with “out of memory” errors as before.

I have attached my NVIDIA bug report log below.
nvidia-bug-report.log.gz (803.5 KB)

Edit: I have since updated to the 575.64.1 driver package, running on the 6.14 Linux kernel, and everything appears to be working as expected, including suspend. The driver does not yet function on the 6.15 kernel for me, but I understand that should be resolved soon.

1 Like

I have the same problem. :(

Possibly related to this? Non-existent shared VRAM on NVIDIA Linux drivers

Not sure, as every time it would report “out of memory” problems, the system had plenty of both VRAM and RAM available, which in theory means shared VRAM wouldn’t be necessary to unsuspend?

1 Like