Qwen3-Coder-Next-FP8: Double memory allocation on DGX Spark when loading from cache
Just got my spark today and trying to run qwen3-coder-next-fp8 using HF AutoModelForCausalLM.from_pretrained() and having weird behavior.
I setup a docker container with nvcr.io/nvidia/pytorch:26.01-py3 and installed tranformers ‘5.2.0’ to get qwen (also wanted to test this flow of a new HF lib on top of the pytorch container for the spark).
The first time I ran the test code on the HF repo, the model actually loaded just fine. It took a while for the model to download, then I saw the ram on the machine go from 2->85gb util, exactly as I expect with HF TQDM slider showing the model was loading. However, I got a random cuda error on inference, so I tried to run again with os.environ["CUDA_LAUNCH_BLOCKING"] = "1".
Every run after that first run, even if I don’t have the cuda env var set, even after restarting the kernel and the container, I see double memory utilization. That is, prior to the model loader tqdm showing up, the system memory usage jumps by 80GB, then the progress bar shows up and the memory grows towards 160GB before OOMing.
I’ve seen similar issues on the spark posted relating to safetensors mmap double-allocation in ComfyUI #10896, and am wondering if anyone knows how to address this and how I can get HF to not double allocate memory?
What I’ve Tried (Nothing Works)
disable_mmap=True→TypeError: Qwen3NextForCausalLM.__init__() got an unexpected keyword argument 'disable_mmap'(exists in diffusers but not in transformers for custom model classes)low_cpu_mem_usage=True→ No effect (device_map="auto"already implies this)use_safetensors=True→ No effect (model only ships in safetensors format)sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'→ No effect on the double-allocationCUDA_LAUNCH_BLOCKING=1→ No effect or somehow triggered this entire episode??device_map="cuda"→ No effect; same physical memory either way
Questions
- Is this a known issue with Huggingface models? Is there a reason why the model loaded the first time and now ALWAYS uses double memory?
- Any workarounds for this and other models?
Hardware: DGX Spark
Environment: Docker container based on nvcr.io/nvidia/pytorch:26.01-py3, transformers 5.2.0 (installed via pip for Qwen3Next architecture support), PyTorch 2.10.0 (from nvidia container)
Model: Qwen/Qwen3-Coder-Next-FP8