My DGX Spark Hangs ... is this normal?

,

Hello, I need some advice or help if possible.

I have installed a Local LLM serving stack for a development team of up to 8 concurrent users, running on a single NVIDIA DGX Spark (128 GB unified memory). With an OpenAI-compatible API via vLLM and a web interface via Open WebUI, both backed by the Qwen3-Coder-Next-FP8 model with FP8 KV cache and chunked prefill for optimal throughput.

This is what I have installed :

  • vLLM (NGC container) : High-throughput LLM inference server with OpenAI-compatible API.
  • Open WebUI (Docker) : Browser-based chat interface, with PostgreSQL + Redis for 8 concurrent sessions
  • LiteLLM Proxy (Docker) : Per-user API key management, usage tracking, and analytics
  • Qwen3-Coder-Next-FP8 : 80B MoE model (3B active params) in FP8 quantization, optimized for code generation

LLM SETTINGS :
=============

Image : nvcr.io/nvidia/vllm:26.01-py3
Model : unsloth/Qwen3-Coder-Next-FP8
Max context : 65536 tokens
Max concurrent : 8 requests
GPU mem util : 0.93
KV cache dtype : fp8_e4m3
Tool calling : true
Tool call parser : qwen3_xml
Prefix caching : true
Chunked prefill : true

PROBLEM : I have connected a single VS Code IDE to it and ran a medium-difficulty task analyzing a personal project. The whole DGX Spark machine hangs completely—no mouse or keyboard response, nothing. no visual errors, no logs errors . I have to manually reboot it to make it work again. The machine has the latest updates installed too.

Is this normal ? is this a hardware issue ? or a software issue, anyone with the same problem ?

After a freeze and reboot, can you view the previous boot logs? You may be running into an OOM error which freezes the Spark
journalctl -k -b -1 -e

When DGX Spark experiences memory shortages leading to swap usage, it suffers from extreme response latency or even appears to ‘freeze.’ This memory shortage occurs despite the model being small enough to run on a single node, primarily due to high GPU memory utilization. If gpu_mem_util is set to 0.93, there may not be enough memory left for essential system operations, triggering swap usage.

When using vLLM, it is generally not recommended to set this value above 0.9. Personally, I expect that Qwen3-Coder-Next-FP8 should perform well even with a gpu_mem_util setting between 0.85 and 0.88. However, if eight users access it simultaneously, the speed might become sluggish. In such cases, trying quantized models like NVFP4 or INT4 could be a good alternative.

running journalctl -k -b -1 -e i get :

Apr 13 12:33:55 spark kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
Apr 13 12:34:40 spark kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
Apr 13 12:35:07 spark kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
Apr 13 12:35:37 spark kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
Apr 13 12:35:38 spark kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
Apr 13 12:36:20 spark kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
Apr 13 12:37:02 spark kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
Apr 13 12:37:57 spark kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
Apr 13 12:38:06 spark systemd-journald[632]: Under memory pressure, flushing caches.
Apr 13 12:38:08 spark systemd-journald[632]: Under memory pressure, flushing caches.
Apr 13 12:38:25 spark systemd-journald[632]: Under memory pressure, flushing caches.
Apr 13 12:38:43 spark systemd-journald[632]: Under memory pressure, flushing caches.
Apr 13 12:38:44 spark systemd-journald[632]: Under memory pressure, flushing caches.
Apr 13 12:38:55 spark systemd-journald[632]: Under memory pressure, flushing caches.

I had the same issue today. I tried to run RedHatAI/Qwen3.5-122B-A10B-NVFP4 with --gpu-memory-utilization 0.85 but somehow (with nothing else running and having run the command to drop caches first), it still consumed all the memory and locked up the box. Initially it just became incredibly slow over ssh, but before I could stop the container it totally stoped responding.

Pressing the power button briefly to try and get it to shutdown did not work, and eventualy I ended up having to hold it down to force a power cut.

I wonder if --gpu-memory-utilization 0.85 doesn’t apply to some memory required as part of loading the model, and that’s what caused it to consume the whole lot?

The exact command I ran was (note: non-standard cache paths):

docker run \
  --name qwen35 \
  -d \
  --gpus all \
  --restart unless-stopped \
  --ulimit memlock=-1 --ulimit stack=67108864 --shm-size=64gb \
  -p 8111:8000 \
  -v ~/ext/cache/huggingface:/root/.cache/huggingface \
  -v ~/ext/cache/vllm:/root/.cache/vllm \
  -e VLLM_NO_USAGE_STATS=1 \
  vllm/vllm-openai:gemma4-cu130 \
  RedHatAI/Qwen3.5-122B-A10B-NVFP4 \
    --port 8000 \
    --host 0.0.0.0 \
    --gpu-memory-utilization 0.85 \
    --served-model-name qwen35 \
    --max-model-len 256k \
    --kv-cache-dtype fp8 \
    --language-model-only \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --enable-chunked-prefill \
    --max-num-batched-tokens 65536 \
    --max-num-seqs 10 \
    --enable-prefix-caching \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
    --trust-remote-code \
    --moe-backend flashinfer_cutlass

The logs after rebooting contain this:

Apr 13 17:05:42 toad kernel: docker0: port 2(vethc22a762) entered blocking state
Apr 13 17:05:42 toad kernel: docker0: port 2(vethc22a762) entered forwarding state
Apr 13 17:38:32 toad kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDe>
Apr 13 17:41:00 toad kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDe>
Apr 13 17:41:03 toad kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDe>
Apr 13 17:41:04 toad kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDe>
Apr 13 17:41:06 toad kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDe>
Apr 13 17:41:07 toad kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDe>
Apr 13 17:41:08 toad kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDe>
Apr 13 17:41:09 toad kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDe>
Apr 13 17:41:10 toad kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDe>
Apr 13 17:41:12 toad kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDe>
Apr 13 17:41:13 toad kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDe>
Apr 13 17:41:14 toad kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDe>
Apr 13 17:41:15 toad kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDe>
Apr 13 17:41:16 toad kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDe>
Apr 13 17:41:18 toad kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDe>
Apr 13 17:41:19 toad kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDe>
Apr 13 17:41:20 toad kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDe>
Apr 13 17:41:54 toad kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDe>
Apr 13 17:42:06 toad kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDe>
Apr 13 17:45:42 toad systemd-journald[636]: Under memory pressure, flushing caches.
Apr 13 17:45:47 toad systemd-journald[636]: Under memory pressure, flushing caches.
Apr 13 17:45:49 toad systemd-journald[636]: Under memory pressure, flushing caches.