When loading a large model like gpt-oss-120b, about 80% of the way loading, the system will freeze and be unresponsive. The mouse cursor won’t move and if you have any animations in Firefox in the background, they also stall. After 15-20 seconds, it keeps going. (This is with 65K context window in LM Studio). Everything works fine after this halting.
If I don’t clear cache, running the multi agent chatbot will also cause the system to crash (but not recover, requiring system reboot).
I assume this has something to do with running out of memory, how quickly cache and virtual memory can be managed — but wanted to be sure I didn’t have bad LPDDR5x.
Hi Alan, can you share the exact steps/frameworks you are loading and running inference on gpt-oss-120b? We want to see if we can locally repro and fix if necessary.
context length 65536
GPU offload 36/36
CPU Threads 18
Offload KV Cache to GPU Memory (yes)
Keep Model in Memory (no)
Seed 94121
Custom Fields (High Reasoning Effort)
Runtime: CUDA 13 Llama.cpp 1.53.1
Not sure if any of the settings matter – that is just what I’ve changed from default.
—
2. The crashing will occur with the playbook instructions for dgx-spark-playbooks/nvidia/multi-agent-chatbot at main · NVIDIA/dgx-spark-playbooks · GitHub
if I have a lot of Firefox windows open and also do not clear cache before starting the docker containers that run everything.
While we work on reproducing your exact scenario, I wanted to share a best practice we recommend: clearing the cache before starting any Docker container. For example:
NOTE
DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
Bash
Copy
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
Meanwhile, if you capture dmesg logs when you see this behavior, please do share here.
[13786.069340] NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359 [13869.129129] NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359 [13877.074194] sh (220893): drop_caches: 3 [13924.293678] sh (221373): drop_caches: 3 [13997.970504] NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
It does give you this error. The whole system doesn’t crash. If you clear the cache AFTER LMStudio loads, it is running about 100GB. If I eject the model, I’m running about 14GB at baseline.
When you watch it on DGX Dashboard, it seems to jump from 72.5GB all the way to 127GB and the dashboard even pauses. It doesn’t always lock up though for the full amount of time. If it clear the cache right before loading, the mouse cursor is responsive although you can see the dashboard freeze. Somehow it’s double.
Is there some sort of kernel patch that you can do where a NV_ERR_NO_MEMORY automatically drops the caches?
I believe the kernel log level is set to quiet and therefore you are not seeing the OOM Killer prints in the log.
Can you increase the log level using below command and then share the log.
dmesg -n 8
Or after reproduction, do a dmesg and share the log.
Another thing:
In your log we see, “sh (220893): drop_caches: 3”, That means drop_cache is run when the issue occurred, But it is not run before the launch of the model? Is that added by you or got enabled as part of the instructions you followed?
Last night, I got two OOM errors with the pause in responsiveness, cleared the cache (which you see) and reloaded, and got no pause in responsiveness other than a stutter, and it generated the OOM error.
TODAY, I did sudo dmesg -n 8.
Dropped the cache first, and then loaded gpt-oss-120b. Didn’t lock but got more errors. Then unloaded the model, loaded something smaller, then reloaded the model. This time it DID lock for ~5 seconds but there was only one error in dmesg.
[44735.549239] NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359 [44997.271913] NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359 [44997.273761] NVRM: nvAssertOkFailedNoLog: Assertion failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from status @ kernel_graphics_context.c:1178 [44997.275550] NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from kgrctxAllocMainCtxBuffer(pGpu, pKernelGraphicsContext, pKernelGraphics, pKernelChannel) @ kernel_graphics_context.c:1387 [44997.278153] NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from kgrctxAllocCtxBuffers(pGpu, pKernelGraphicsObject->pKernelGraphicsContext, pKernelGraphics, pKernelGraphicsObject) @ kernel_graphics_object.c:214 [44997.442466] NVRM: rpcRmApiAlloc_GSP: GspRmAlloc failed: hClient=0xc1d00e18; hParent=0xbeed0100; hObject=0xbeed3901; hClass=0x0000cd40; paramsSize=0x00000000; paramsStatus=0x00000057; status=0x00000057
There are actually two different behaviors.
It will always stall the system in terms of the blinking cursor in LM Studio or even the polling interval of DGX Dashboard. When I expect the freezing to happen ~80% of the model load, the steady linear increase in RAM usage will halt, and then jump up to 127GB.
Sometimes the mouse cursor will ALSO lock. When the mouse cursor is unresponsive, you don’t get extra dmesg messages
I can always generate the single line, even if I drop caches
[45822.801325] sh (597128): drop_caches: 3 [45859.583277] NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
The more complex errors don’t seem to correlate with the locking up. Maybe it was able to avoid the locking up due more graceful error handling?