Spark-vllm-docker runs out of memory loading Qwen3.5-397B-A17B-int4-AutoRound

Thanks, man. And congrats on the new job.

For some reasons, this model can now be loaded with kv_cache_dtype bfloat16. Not sure if it’s because vllm 0.20.1 or latest firmware update from few days ago.

(APIServer pid=86) INFO 05-07 09:49:56 [utils.py:233] non-default args: {'model_tag': 'Intel/Qwen3.5-397B-A17B-int4-AutoRound', 'chat_template': 'unsloth.jinja', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_xml', 'host': '0.0.0.0', 'port': 5803, 'model': 'Intel/Qwen3.5-397B-A17B-int4-AutoRound', 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 262144, 'override_generation_config': {'temperature': 0.6, 'min_p': 0.01, 'top_p': 0.95, 'top_k': 20, 'presence_penalty': 0.0, 'repetition_penalty': 1.0}, 'load_format': 'instanttensor', 'attention_backend': 'FLASHINFER', 'reasoning_parser': 'qwen3', 'master_addr': '169.254.129.240', 'nnodes': 2, 'tensor_parallel_size': 2, 'gpu_memory_utilization_gb': 112.0, 'kv_cache_dtype': 'bfloat16', 'enable_prefix_caching': True, 'max_num_batched_tokens': 4176, 'max_num_seqs': 2}
// ...
(EngineCore pid=161) INFO 05-07 09:53:47 [kv_cache_utils.py:1709] GPU KV cache size: 393,216 tokens
(EngineCore pid=161) INFO 05-07 09:53:47 [kv_cache_utils.py:1710] Maximum concurrency for 262,144 tokens per request: 1.50x

just for info:122b with same start parameters

on 20.1: Available KV cache memory: 39.46 GiB

on 20.2 dev: Available KV cache memory: 26.97 GiB

Thats crazy! No wonder 397B wasn’t working.

So, the culprit is this environment variable: PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:True" - once you remove it, everything works normally, no OOM.

I pushed an updated version of the recipe where I revert to GPU memory utilization in % and lower it to 0.9 - looks like new builds are actually more memory efficient, and fit more KV cache.

Thanks for taking the time to look into it! I’ll give it another shot with the latest vllm pull.

I tested with the latest vllm pull and it’s working well now. Thank you eugr!

Yea, congrats on becoming Nvidia

I may try to revert to 0.19.1. but what’s your total vram? I have that Gigabyte Atom which is still 119GB, 2 GB short compare with DGX Spark. My only approach is use another model gptq-int4 to run with sglang.

I have both the gold NVIDIA-branded one and an ASUS GX10. Each reports about 121GB. But, I stripped them down, running only the essential stuff. You can purge or mask the display manager, desktop environment and all the non-essential stuff included with Ubuntu. I just uninstalled it all since I run these headless, and every last byte matters when running Qwen3.5-397B. Removing unneeded bloat is even more important on my GX10, since it’s the cheapest model with only 1TB of storage.

Try the latest builds, I haven’t noticed any issues with RAM usage.

Seems to be working okay for me. The only issue I ran into was rebuilding the Docker image failed with any version of flashinfer newer than v0.6.9. But A-OK with latest vllm. Even with KV cache set to bfloat16, I’m able to nearly max out the 260K context according to vllm, with about 6GB of KV cache left over after loading the model. 🥳

That’s an amazing achievement to use bf16! You are safe to run claude code locally I think.

Will there be any MTP space left?

How big is the context window it will support on a single Spark and how fast is it with large contexts? Is caching working here?

Unlikely you’ll be able to run 397B on a single Spark in any effective way. The max context length is 262144 tokens:

"max_position_embeddings": 262144

You can turn MTP on, but it doesn’t really help much: