Double memory use in Huggingface Qwen3 coder next

Qwen3-Coder-Next-FP8: Double memory allocation on DGX Spark when loading from cache

Just got my spark today and trying to run qwen3-coder-next-fp8 using HF AutoModelForCausalLM.from_pretrained() and having weird behavior.

I setup a docker container with nvcr.io/nvidia/pytorch:26.01-py3 and installed tranformers ‘5.2.0’ to get qwen (also wanted to test this flow of a new HF lib on top of the pytorch container for the spark).

The first time I ran the test code on the HF repo, the model actually loaded just fine. It took a while for the model to download, then I saw the ram on the machine go from 2->85gb util, exactly as I expect with HF TQDM slider showing the model was loading. However, I got a random cuda error on inference, so I tried to run again with os.environ["CUDA_LAUNCH_BLOCKING"] = "1".

Every run after that first run, even if I don’t have the cuda env var set, even after restarting the kernel and the container, I see double memory utilization. That is, prior to the model loader tqdm showing up, the system memory usage jumps by 80GB, then the progress bar shows up and the memory grows towards 160GB before OOMing.

I’ve seen similar issues on the spark posted relating to safetensors mmap double-allocation in ComfyUI #10896, and am wondering if anyone knows how to address this and how I can get HF to not double allocate memory?

What I’ve Tried (Nothing Works)

  • disable_mmap=TrueTypeError: Qwen3NextForCausalLM.__init__() got an unexpected keyword argument 'disable_mmap' (exists in diffusers but not in transformers for custom model classes)
  • low_cpu_mem_usage=True → No effect (device_map="auto" already implies this)
  • use_safetensors=True → No effect (model only ships in safetensors format)
  • sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' → No effect on the double-allocation
  • CUDA_LAUNCH_BLOCKING=1 → No effect or somehow triggered this entire episode??
  • device_map="cuda" → No effect; same physical memory either way

Questions

  1. Is this a known issue with Huggingface models? Is there a reason why the model loaded the first time and now ALWAYS uses double memory?
  2. Any workarounds for this and other models?

Hardware: DGX Spark
Environment: Docker container based on nvcr.io/nvidia/pytorch:26.01-py3, transformers 5.2.0 (installed via pip for Qwen3Next architecture support), PyTorch 2.10.0 (from nvidia container)
Model: Qwen/Qwen3-Coder-Next-FP8

You may prefer to build your own stack, but I would recommend to start with the community docker:

It’s super simple to launch a model such as Qwen3-Coder-Next-FP8 and you’ll be chatting with opencode in no time.

@AoE Thanks for the reply. Yep totally aware of vLLM and Llama.cpp as other options for inference, but I am interested in fine tuning. As a side note, loading models from HF is a core thing ML engineers do in this day and age, so just want to make sure it works on the spark, since training models was the key differentiation for me in buying a spark over a mac or amd device.

1 Like

gentle bump

This behavior is consistent with how safetensors mmap loading interacts with unified memory on the Spark. You’re not seeing a leak — you’re seeing overlapping memory pressure from file-backed mmap and tensor allocation in the same physical memory pool.

On the first run, the model streams from disk and materializes into tensors (~80–85GB). On subsequent runs, the model weights are already resident in the page cache, so mmap maps that cached data into memory before the tensor allocation begins. Then the loader allocates the actual tensors on top of it.

On discrete GPU systems, file-backed cache (host RAM) and model tensors (VRAM) are separate, so this overlap is usually harmless. On Spark’s unified memory, both compete for the same physical pool, so cached weights directly reduce available capacity for CUDA allocations. That can effectively double the observed memory pressure and push the system into OOM.

In addition, mmap on the Spark can be page-fault heavy and relatively slow for large model loads. Each access triggers page faults within the unified memory system, which adds overhead during initialization. So the issue is not just memory usage — mmap-based loading interacts poorly with unified memory both in terms of latency (page faults) and capacity (shared pool with CUDA allocations).

You can confirm this by checking cached memory before and after runs:

cat /proc/meminfo | grep -i Cached

If cached memory roughly matches the model size after the first run, that’s the page cache holding the mmap’d weights.

This behavior has also been discussed here:

For your specific setup (HuggingFace from_pretrained()):

1. Drop page cache immediately before loading

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

This needs to be done immediately before the from_pretrained() call. If the model files get re-cached between the drop and load (e.g. during file scanning), the behavior returns.

2. Increase NVMe read-ahead

sudo bash -c "echo 8192 > /sys/block/nvme0n1/queue/read_ahead_kb"

This has been reported to reduce mmap load time by improving prefetch behavior and reducing page fault overhead.

3. Alternative loading paths

In some cases, loading weights manually (e.g. via safetensors.torch.load_file) can avoid mmap, but this depends on the model and may require additional handling. It’s not always a drop-in replacement for from_pretrained().

4. Monitor unified memory during loading

On Spark, application-level memory APIs (e.g. cudaMemGetInfo) do not reflect page cache usage. To observe actual pressure during initialization:

watch -n1 free -h

If available memory drops toward zero during model load, the page cache and CUDA allocations are competing within the same unified memory pool.

This isn’t specific to Qwen3 — any large safetensors model loaded via mmap on a unified memory system will show similar behavior once the weights are cached.

2 Likes

This is the same class of bug I hit loading Qwen 3.5 122B NVFP4 on a 128GB DGX Spark. Peak memory hit 127.5GB deterministically at 93% of tensors loaded. Same double-counting pattern. The underlying issue is that from_pretrained’s WeightConverter pipeline retains tensor references even after placement, and on unified memory those “ghost tensors” compete with the GPU-resident weights.

I wrote a fix (NVFP4PlaceOp) that uses untyped_storage().resize_(0) to free source bytes immediately while preserving tensor object lifetime, plus a custom streaming loader that skips the HF loader ceremony. Peak dropped from 127.5GB to 103GB on the same workload.

The drop_caches workaround works but treats the symptom. The real fix needs to live inside the quantizer’s conversion path.

1 Like