Spark-vllm-docker runs out of memory loading Qwen3.5-397B-A17B-int4-AutoRound

I’ve been trying to troubleshoot a peculiar memory problem over the past 3 days where loading Qwen3.5-397B-A17B-int4-AutoRound across two Sparks runs out of memory and locks up both machines. And I really mean locks up: both require a power cycle to come back. This was working perfectly before a recent firmware update, so I’m not sure if that has something to do with it.

Launching eugr’s docker image, everything works fine until here:

Loading safetensors using InstantTensor loader:  78% Completed | 220542/280989 [00:20<00:05, 11066.88it/s]
Loading safetensors using InstantTensor loader:  83% Completed | 231864/280989 [00:22<00:04, 10626.39it/s]
Loading safetensors using InstantTensor loader:  87% Completed | 243272/280989 [00:23<00:03, 10843.67it/s]
Loading safetensors using InstantTensor loader:  90% Completed | 254274/280989 [00:24<00:02, 10697.51it/s]
Loading safetensors using InstantTensor loader:  95% Completed | 265719/280989 [00:25<00:01, 10911.06it/s]
Loading safetensors using InstantTensor loader:  98% Completed | 276718/280989 [00:26<00:00, 9799.99it/s]
Loading safetensors using InstantTensor loader: 100% Completed | 280989/280989 [00:27<00:00, 10264.98it/s]
(Worker_TP0 pid=208) 
(Worker_TP0 pid=208) INFO 05-05 12:04:41 [default_loader.py:391] Loading weights took 29.38 seconds
type or paste code here

And then it stops. Curiously, htop shows a spike in memory usage from about 109GB (while loading) to 121GB right after loading. I guess this would be around the point where the CUDA graphs would be built. So, something is happening after loading the model that exhausts all memory on both nodes:

I am using the qwen3.5-397b-int4-autoround.yaml recipe. Other (smaller) models seem to work okay, it’s just this one won’t work (when previously it worked fine).

Also, I confirmed NCCL comms are working fine between both systems. A recent ASUS GX10 firmware update fixed the 16GB/sec cap (version 0x3000006), so now I’m back to 24GB/sec between systems.

Any ideas? I’d love to get back up and running.

You’ve also tried load-format safetensors with multiple drop_caches, right?

Unfortunately, using the slower ‘safetensors’ loader doesn’t make a difference, but I did see something weird. The 2nd node (GX10) spiked to 121 GB of memory usage after only 5% of the model had loaded. Of course, things halted at this point since the 2nd node became unresponsive, but eventually things kept going until the model loaded to 100%, and the same 121GB memory issue happened, locking up both nodes. You can see here the strange delay around 5% loading:

Loading safetensors checkpoint shards:   0% Completed | 0/40 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   2% Completed | 1/40 [00:11<07:10, 11.05s/it]
Loading safetensors checkpoint shards:   5% Completed | 2/40 [00:22<07:14, 11.44s/it]
Loading safetensors checkpoint shards:   8% Completed | 3/40 [07:22<2:01:56, 197.75s/it]
Loading safetensors checkpoint shards:  10% Completed | 4/40 [07:34<1:14:45, 124.59s/it]
Loading safetensors checkpoint shards:  12% Completed | 5/40 [07:45<48:42, 83.50s/it]
Loading safetensors checkpoint shards:  15% Completed | 6/40 [07:55<33:10, 58.54s/it]
Loading safetensors checkpoint shards:  18% Completed | 7/40 [08:05<23:26, 42.62s/it]
Loading safetensors checkpoint shards:  20% Completed | 8/40 [08:18<17:43, 33.23s/it]
Loading safetensors checkpoint shards:  22% Completed | 9/40 [08:29<13:33, 26.25s/it]
Loading safetensors checkpoint shards:  25% Completed | 10/40 [08:40<10:45, 21.52s/it]
Loading safetensors checkpoint shards:  28% Completed | 11/40 [08:51<08:51, 18.33s/it]
Loading safetensors checkpoint shards:  30% Completed | 12/40 [09:03<07:40, 16.43s/it]
Loading safetensors checkpoint shards:  32% Completed | 13/40 [09:13<06:28, 14.37s/it]
Loading safetensors checkpoint shards:  35% Completed | 14/40 [09:22<05:36, 12.96s/it]
Loading safetensors checkpoint shards:  38% Completed | 15/40 [09:32<04:58, 11.94s/it]
Loading safetensors checkpoint shards:  40% Completed | 16/40 [09:44<04:48, 12.00s/it]
Loading safetensors checkpoint shards:  42% Completed | 17/40 [09:56<04:35, 11.97s/it]
Loading safetensors checkpoint shards:  45% Completed | 18/40 [10:08<04:23, 11.97s/it]
Loading safetensors checkpoint shards:  48% Completed | 19/40 [10:19<04:04, 11.65s/it]
Loading safetensors checkpoint shards:  50% Completed | 20/40 [10:30<03:50, 11.51s/it]
Loading safetensors checkpoint shards:  52% Completed | 21/40 [10:41<03:36, 11.40s/it]
Loading safetensors checkpoint shards:  55% Completed | 22/40 [10:53<03:27, 11.52s/it]
Loading safetensors checkpoint shards:  57% Completed | 23/40 [11:05<03:17, 11.62s/it]
Loading safetensors checkpoint shards:  60% Completed | 24/40 [11:16<03:03, 11.47s/it]
Loading safetensors checkpoint shards:  62% Completed | 25/40 [11:28<02:54, 11.64s/it]
Loading safetensors checkpoint shards:  65% Completed | 26/40 [11:40<02:44, 11.76s/it]
Loading safetensors checkpoint shards:  68% Completed | 27/40 [11:52<02:33, 11.77s/it]
Loading safetensors checkpoint shards:  70% Completed | 28/40 [12:03<02:18, 11.54s/it]
Loading safetensors checkpoint shards:  72% Completed | 29/40 [12:14<02:05, 11.40s/it]
Loading safetensors checkpoint shards:  75% Completed | 30/40 [12:25<01:55, 11.51s/it]
Loading safetensors checkpoint shards:  78% Completed | 31/40 [12:37<01:42, 11.43s/it]
Loading safetensors checkpoint shards:  80% Completed | 32/40 [12:49<01:32, 11.56s/it]
Loading safetensors checkpoint shards:  82% Completed | 33/40 [13:00<01:20, 11.53s/it]
Loading safetensors checkpoint shards:  85% Completed | 34/40 [13:11<01:07, 11.29s/it]
Loading safetensors checkpoint shards:  88% Completed | 35/40 [13:22<00:55, 11.17s/it]
Loading safetensors checkpoint shards:  90% Completed | 36/40 [13:33<00:45, 11.27s/it]
Loading safetensors checkpoint shards:  92% Completed | 37/40 [13:45<00:34, 11.43s/it]
Loading safetensors checkpoint shards:  95% Completed | 38/40 [13:56<00:22, 11.38s/it]
Loading safetensors checkpoint shards:  98% Completed | 39/40 [14:04<00:10, 10.22s/it]
Loading safetensors checkpoint shards: 100% Completed | 40/40 [14:16<00:00, 10.84s/it]
Loading safetensors checkpoint shards: 100% Completed | 40/40 [14:16<00:00, 21.41s/it]

And yes, I drop the cache before each run. Here’s the journalctl output from the last boot, right before the hard reset:

May 05 06:45:04 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:04 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:04 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:05 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:05 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:05 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:05 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:05 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:05 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:05 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:05 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 05 06:45:05 spark-9555 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359

Something definitely seems off here.

I’m experiencing something similar. While loading the Hybrid 122b-A10B in my single GB10, after loading the models (somewhere around compiling torch or something) the RAM spikes to 100% and the whole system locks up. I was able to recover it twice when trying by Control C right when it’s starting to load all RAM (by monitoring from another terminal) but nothing in my system/recipe change or anything.

Something is odd on my system, I thought it was something I’ve done myself but it could easily be related if it’s something that worked for you in the past and stopped working now.

BTW, adjusting the GPU memory availability had no effect in my case

Oh it was definitely working before I updated. I was running the 397B version across the same two nodes for a few months nonstop, using it for tons of agentic work, and it was running great.

Found something:

I was trying to remember what things changed in my system besides a ‘sudo apt upgrade’ that I ran a couple of days back (that one is harder to undo…), and I remember I turned down swappiness HEAVILY in my machine as I went from 60 (default) to 10 to then finally to 0. This was because I saw slowdowns sometimes and swap being utilized when there was still plenty of RAM available.

In order to make my 122b-A10B model load, I tightened all the parameters (Context, KV params, Gpu mem max allocation, etc). and I was still getting hangs.

Right here, memory would go to 100% and freeze, If I would control+c quickly when memory overshoots, I could recover. Leave it for 15 seconds and it’s done, It won’t recover.

So today I remembered swappiness and changed it back to defaults:

sudo sysctl vm.swappiness=60

Launched the 122b-A10B again and… It loaded! But more interestingly, look what it does in that particular moment (same as where it was dying before):

Memory temporarily overshoots maximum RAM, swap goes up by 3.5GB.

After a successful load, this is how it looked like, so that “initial profiling/warmup” was the cause of my issues with such an aggressive swappiness value.

Went back to a value of 10 swappiness and it successfully loaded again. I will later do more test to see what’s the lowest value I get away with but it looks like this model (and just this one for me) needs to swap ~3.5GB of ram no matter how I set up my recipe, and with a 0 swappines value, it will just try to stay in ram and crash.

Check your current value @Phaserblast (cat /proc/sys/vm/swappiness) and if it’s too low, try increasing it and see how it goes.

Finally, after my model loaded, if I clean the swap with sudo swapoff -a && sudo swapon -a , It just removes everything from swap and works perfectly.

I tried that, increasing swappiness to 100, but it didn’t help. Swap usage never went over 2.6GB, despite the available swap being 16GB.

When it starts to load the tensors, run this command in a separate terminal

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

Works for me hopefully it will you too!

Eh, good idea—but unfortunately, no change. The system still explodes after the model loads.

It works most of the time when I’m using a slightly tighter mem utilization with --gpu-memory-utilization-gb 106. I’m running Sunshine and that eats up ~2gb by default for me:

./launch-cluster.sh -t vllm-node:latest \
   -e PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" \
   -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
   --apply-mod mods/fix-qwen3.5-autoround \
   --apply-mod mods/fix-qwen3.5-chat-template \
   --apply-mod mods/gpu-mem-util-gb \
   -j 8 \
   exec vllm serve Intel/Qwen3.5-397B-A17B-int4-AutoRound \
     --max-model-len 131072 \
     --max-num-seqs 2 \
     --gpu-memory-utilization-gb 106 \
     --port 8000   --host 0.0.0.0 \
     --enable-prefix-caching \
     --enable-auto-tool-choice \
     --tool-call-parser qwen3_coder \
     --reasoning-parser qwen3 \
     --trust-remote-code \
     --chat-template unsloth.jinja \
     -tp 2 \
     --max-num-batched-tokens 4176 \
     --kv-cache-dtype fp8 \
     --distributed-executor-backend ray

I’m experiencing the exact same issue. The 397B model used to work fine (112.5gb), but now it doesn’t(110gb), and the symptoms are identical.

I’ve tried both old and new vLLM images, but the situation remains the same. Is it possible that something changed after the latest firmware update?

This graph shows the memory spike that occurs right after the model finishes loading. Everything is fine until something causes the huge memory spike that brings everything to a screeching halt:

The wrong response, I know, because what once worked should still work, but until you get this sorted out 3.6-27b is honestly so good that I’ve not needed to go back to 3.5-397b for anything.

Yeah if you’re in a single spark use the 122B.

The core of the issue is that “something has changed,” and what previously worked flawlessly is now failing.

It appears to be either a software or a firmware update issue, as even the older vLLM versions—which used to be stable—are now exhibiting the same behavior. This is not isolated to the original poster; others are experiencing this as well.

Essentially, any configurations or “recipes” that push the hardware to its limits are now performing worse.

With all due respect, suggesting to “use a smaller model” isn’t a viable solution. The Spark systems were specifically acquired to handle large-scale models, not smaller ones.

Reverted to vllm 0.19.1 and it works. 🏆 v0.20.1 works, too. So stick to these releases.

Memory spike is gone, and getting about 30-31 t/s according to llama-benchy.

I modified the Dockerfile to clone the ‘v0.19.1’ branch of vllm. In addition, I also had to checkout only the ‘requirements’ dir from 0.20.0, otherwise vllm wouldn’t build. One more thing, I also needed to disable the mods/gpu-mem-util-gb/gpu_mem.patch as this wouldn’t apply cleanly to the v0.19.1 code.

I’ll do some more testing to see if it’s actually vllm or the patches that are causing the problem. UPDATE: yep, it seems like an issue with the vllm main branch.

Reread my statement. I didn’t suggest “swap models and forget about this one” like you’re implying. We all know these devices are a never-ending box of changing configs and reliability, so I said “try this other as a temp band-aid.” I wrongly assumed OP had tried earlier builds, which would have pointed towards a FW/update-related issue to model loading (one we couldn’t directly affect now).

I have this same problem and need to rebuild docker again with this command to pin the version and it’s back to work again (for me)

./build-and-copy.sh --tf5 --vllm-ref releases/v0.20.1 --flashinfer-ref release-v0.6.9 -c spark-02

My guess is the latest vllm from main branch has some problems. But I’m too lazy to test and pinpoint the root cause since building an image took a very looooong time.

Yep, you’re right. I just tested it, and v0.20.1 works too. So they definitely messed something up between then and now.

This is one of the models that I don’t have in the build pipeline (yet) due to its resource requirements, but I’ve noticed that issue too.

I’ll look into it.

There are also some regressions in vLLM/Flashinfer that caused builds to fail for the past 3 days, but if you are not rebuilding from source, the latest “stable” builds are fine.