Dear community of developers,
I am but a simple doctor who has been enthusiastically using a DGX cluster of 2 to run Qwen 397b as an inference model for my clinical notes. Believe it or not, it’s the first and only local model that I have found that effortlessly generates a medical note from my conversation, a system prompt, and a couple of small search tools. This has been a huge boon to my patient care since I can just have a conversation, human being to human being, and let a robot tale care of 90% of modern medicine’s woes, aka note writing and filling in billing codes.
Now, all was well…until I decided to update the docker repository that @eugr_nv has kindly created. Luck would have it, it has broken the memory allocation for this model and I’m unable to run it.
I am wondering how I can roll back to an earlier version of the repository and vLLM build?
By the way, I know there is awesome software engineering talent here, fantastic CAD wizards….but I’ll nominate myself to be the team doc :D!
@jc2375 Fantastic use of the Sparks. We may be in similar boats. I’ve been unable to run vLLM since I installed NVIDIA Sync updates today. (It hangs both Sparks brutally.) In my case, when I check system logs:
journalctl -b -1 -k | grep -iE 'memory'
I see this error:
NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
I think it worked. Had to go bsck and change to qwen3_coder in the recipe vs the xml, tool calls are more reliable for me with that template. But it is working again!!
Yes, there seems to be an issue with the most recent vLLM build and how it calculates available memory. I’ll see if I can find the cause if it’s not fixed in the next nightly build.
Following up on potentially the root cause of @jc2375’s issue, regarding the NV_ERR_NO_MEMORY kernel log entry: (I apologise for the AI-generated phrasing, but I’ve reviewed it for accuracy):
I’ve observed it on a docker-pinned vLLM build (0.19.1rc1.dev1315+g102aaddf8.d20260519, image built three days ago via eugr/spark-vllm-docker), so it’s not unique to the most recent vLLM nightly.
Today’s occurrence was non-fatal — single dmesg entry at the exact moment the MTP drafter model finished loading and shared-weight remapping began (mtp.py:484 + llm_base_proposer.py:1392/1448). Run proceeded to serve traffic normally and is still serving. dmesg only shows the bare allocation failure (_memdescAllocInternal @ mem_desc.c:1359) — no os_acquire_rwlock_read cascade like last night.
So I think there are two distinct phenomena: (1) _memdescAllocInternal returns NO_MEMORY more often than vLLM expects on Spark UMA, which is mostly cosmetic, and (2) sometimes that failure triggers an RM lock cascade (matching open-gpu-kernel-modules#968), which is fatal. Today I hit (1) only; yesterday I hit (1)→(2). Whether vLLM’s memory accounting is the cause or a contributing factor to (1), the underlying recoverability of (1) might also be different across vLLM versions.
Happy to grab any additional logs that would help.