Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB

Hi everyone,

I’ve quantized Qwen3.5-122B-A10B (Alibaba’s latest multimodal MoE model) from BF16 to NVFP4 so it fits on a single DGX

Spark. The original model is ~234GB which exceeds our 128GB unified memory. This quantized version is 75.6GB, leaving ~52GB

headroom for KV cache and vLLM overhead.

Sharing it here so other Spark owners can use it directly without going through the quantization process.

HuggingFace: alpertor/Qwen3.5-122B-A10B-NVFP4 · Hugging Face

About the model

Qwen3.5-122B-A10B is a multimodal Mixture-of-Experts model with:

- 122B total parameters, ~10B active per token

- 256 experts per layer (8 active), 48 layers

- Hybrid attention: DeltaNet (linear) + standard full attention

- Supports text, image, and video understanding

- Think/no-think mode for reasoning tasks

Quantization details

Format: NVFP4 (4-bit floating point weights, FP8 per-group scales, group size 16)

Original size: 234 GB (BF16)

Quantized size: 75.6 GB

Compression ratio: ~3.1x

Tool: vllm-project/llm-compressor + compressed-tensors

Calibration: 512 samples from ultrachat_200k, 2048 max sequence length

All MoE experts calibrated: Yes (moe_calibrate_all_experts=True)

Hardware: 4x NVIDIA H100 80GB (Vast.ai, ~1.5 hours)

Quantized to FP4 (reduced precision):

- MoE expert weights — gate_proj, up_proj, down_proj across all 256 experts × 48 layers

- Full attention projection weights (self_attn Q/K/V/O)

- Shared expert weights

Kept at BF16 (full precision):

- lm_head (output generation layer)

- MoE router/gate networks (expert selection)

- Shared expert gate

- DeltaNet / linear attention layers

- Vision encoder (all visual processing)

- Layer norms and embeddings

This means all routing decisions, vision processing, and output generation run at full precision. Only the bulk computation

(expert FFN and attention projections) is quantized.

How to use on DGX Spark

Step 1: Download the model (~75.6GB)

pip install huggingface_hub[cli]

huggingface-cli download alpertor/Qwen3.5-122B-A10B-NVFP4 \

--local-dir /models/Qwen3.5-122B-A10B-NVFP4

Step 2: Serve with eugr’s spark-vllm-docker

I’m using @eugr’s spark-vllm-docker which is specifically optimized for DGX Spark. If you haven’t set it up yet, check out

eugr’s repo — it handles all the vLLM configuration for Spark’s unified memory architecture.

Example serving configuration:

–model /models/Qwen3.5-122B-A10B-NVFP4

–quantization compressed-tensors

–trust-remote-code

–max-model-len 4096

I’ll update this thread with actual serving results, throughput numbers, and any configuration tweaks once I have it fully

running.

Notes and caveats

- Compatibility: This model requires transformers >= 5.1.0 for the qwen3_5_moe model type. The current llm-compressor

(v0.9.1) officially supports transformers <= 4.57.6, so some patching was required during quantization. The saved model

itself should load fine with vLLM.

- Expert weight packing: llm-compressor correctly quantized and packed the shared expert weights but left the MoE expert

weights in BF16 during save (appears to be a known issue with MoE models). I post-processed the shards to manually pack the

expert weights to NVFP4 format (uint8 packed + FP8 scales). The calibration data was used to determine optimal per-group

scales before packing.

- Quality: I expect typical FP4 quantization accuracy (~1-3% benchmark degradation vs BF16). If anyone runs evals or notices

quality issues, please share your findings.

- Vision: The vision encoder is preserved at full BF16 precision. Multimodal capabilities (image/video understanding) should

work as expected, though I haven’t extensively tested this yet.

Quantization process for the curious

For anyone who wants to reproduce or quantize other large models for Spark:

1. Rented 4x H100 80GB on Vast.ai (~$6.40/hr)

2. Used llm-compressor’s oneshot() with QuantizationModifier(scheme=“NVFP4”) and calibration on ultrachat_200k

3. Needed transformers 5.2.0 (patched two compatibility issues with llm-compressor)

4. Post-processed safetensors shards on CPU to pack MoE expert weights from BF16 to uint8 (FP4 packed)

5. Uploaded to HuggingFace

Total cost was under $15 including failed attempts and model downloads.

If you try this model on your Spark, please share your experience — especially serving configs that work well, throughput

numbers, and any quality observations. Happy to answer questions about the quantization process.

5 Likes

Hi @alper.tor — great work on this, really appreciate the effort putting it together for DGX Spark.

I’ve been trying to get it running and hitting a consistent issue where every inference response is just repeated exclamation marks (token ID 0) regardless of prompt. Wanted to share what I’ve found in case it helps and to ask if you’ve seen this.

Setup:

  • DGX Spark, NVIDIA GB10 (compute capability 12.1, 128GB unified memory)
  • eugr’s spark-vllm-docker (vLLM 0.16.1rc1.dev14)
  • serving with --quantization compressed-tensors --language-model-only --enforce-eager

What I’ve tried:

  • VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass (default) — server loads, inference returns 500x !
  • VLLM_NVFP4_GEMM_BACKEND=cutlass — same result
  • --enforce-eager to skip CUDA graph compilation — same result
  • Verified tokenizer is working: tok.encode('Hello') returns [9419] correctly
  • Token ID 0 decodes as ! in this model’s vocabulary

What I’m seeing in the server logs during inference:

UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (16).
This may indicate the inputs were passed in head-first format [B, H, T, ...]
Please verify your input tensor format matches the expected shape [B, T, H, ...]

This warning comes from the FLA ops library handling the DeltaNet/linear attention layers. My suspicion is the linear attention layers (kept at BF16 as per your recipe) are receiving tensors in the wrong format in this vLLM build, producing garbage hidden states that propagate to lm_head, causing token 0 to dominate every output position.

Questions:

  1. Have you been able to confirm inference is working on your end, and if so which exact vLLM version/commit?
  2. Did you see the tensor format mismatch warning during your testing?
  3. Any specific serve flags beyond the ones in your readme that were needed to get clean output?

Happy to share the full patch list I’ve applied to get the server loading — there were several missing attribute errors in this vLLM build that needed fixing before it would even start. Might be useful for others trying this.

Thanks

Please rebuild the container today using --rebuild-vllm flag. There was an issue in vLLM that broke most quantized models, but it was promptly fixed yesterday.

Hi @eugr,

Trying to serve alpertor/Qwen3.5-122B-A10B-NVFP4 on DGX Spark using spark-vllm-docker. Fresh build today after docker builder prune -af to ensure clean cache.

Environment:

  • spark-vllm-docker: dbd3d21 (latest main)

  • vLLM: 0.16.1rc1.dev30+g0f2f24c8b.d20260226

  • Build command: ./build-and-copy.sh --rebuild-vllm --tf5

Serve command (from alpertor’s model card):

bash

./launch-cluster.sh --solo exec vllm serve \
  /root/.cache/huggingface/hub/alpertor/Qwen3.5-122B-A10B-NVFP4 \
  --quantization compressed-tensors \
  --trust-remote-code \
  --max-model-len 4096 \
  --port 8000 \
  --host 0.0.0.0 \
  --reasoning-parser qwen3
```

**Error:**
```
TypeError: Invalid type of HuggingFace config.
Expected type: <class 'vllm.transformers_utils.configs.qwen3_5_moe.Qwen3_5MoeConfig'>
but found type: <class 'transformers.models.qwen3_5_moe.configuration_qwen3_5_moe.Qwen3_5MoeTextConfig'>

Full traceback leads to qwen3_5.py:119 get_hf_config(Qwen3_5MoeConfig)context.py:139.

The model uses model_type: qwen3_5_moe_text (text-only variant) but vLLM’s internal config class expects Qwen3_5MoeConfig (the multimodal variant). The --language-model-only flag doesn’t bypass this path.

Any guidance appreciated — happy to test patches.

Something seems off because per the original post, this model was supposed to be quantized with multimodal capabilities intact.

@jwarner agreed — we’ve updated our approach. Latest test is without --language-model-only, using alpertor’s exact serve command from the model card. Still crashes immediately with:

TypeError: Invalid type of HuggingFace config.
Expected: Qwen3_5MoeConfig
Found: Qwen3_5MoeTextConfig

vLLM 0.16.1rc1.dev30+g0f2f24c8b.d20260226, spark-vllm-docker dbd3d21. Fresh build today. The config type mismatch seems to be the root issue — vLLM has no handler for qwen3_5_moe_text model type.

I would ask the quant maker on HF in the model card discussion tab. Why this particular quant though? It’s not even listed in quantizations. There are other NVFP4 quants on the model page. And AWQ quants as well.

**Update — Working configuration with txn545’s ModelOpt quantization**

Thanks everyone for the feedback and testing. I owe you an honest update. I was very excited getting the quantization completed and uploaded the model without properly testing myself.

**
**What went wrong with my quantization****

The issues you’ve been hitting — garbage output (repeated `!`), `Qwen3_5MoeTextConfig` type mismatch, missing vision — are all real. My llm-compressor quantization had fundamental problems:

1. **Expert calibration was broken.** All MoE expert `global_scale` values were 1.0 (uncalibrated). This is the root cause of the garbage token-0 output. Qwen3.5 uses fused 3D expert tensors (`[num_experts, hidden, intermediate]`) that llm-compressor doesn’t know how to unfuse for per-expert calibration. There’s a fix in [llm-compressor PR #2383]( feat: add Qwen3.5 MoE calibration module by Sehyo · Pull Request #2383 · vllm-project/llm-compressor · GitHub ) (`CalibrationQwen3_5MoeSparseMoeBlock`), but it wasn’t merged when I quantized.

  1. **Vision encoder was stripped.** llm-compressor’s `oneshot()` saved only the language model, changing `model_type` from `qwen3_5_moe` to `qwen3_5_moe_text`. This caused the config type mismatch you saw.

  2. **My manual expert weight post-processing was incorrect.** The packed uint8 weights without proper calibrated scales produced garbage.

I attempted to re-quantize using the PR #2383 branch but couldn’t get a clean result in time. I’ll remove the broken model from HuggingFace to avoid further confusion.

**Working solution: txn545’s ModelOpt quantization**

I switched to [txn545/Qwen3.5-122B-A10B-NVFP4]( txn545/Qwen3.5-122B-A10B-NVFP4 · Hugging Face ), quantized with **NVIDIA ModelOpt v0.42.0**. This model has proper expert calibration, intact vision encoder, and correct `qwen3_5_moe` model type. 77.8 GB, 2 safetensor shards.

However, it doesn’t work out-of-the-box on DGX Spark. Here’s what I had to fix:

**Fix 1: Docker image — Avarok’s dgx-vllm-qwen35**

I’m using Avarok’s `dgx-vllm-qwen35` image (not eugr’s spark-vllm-docker) because it has Qwen3.5 MoE support built in. I built a patched version tagged `dgx-vllm-qwen35:v1-gate-fix`.

**Fix 2: MoE gate must stay BF16**

The MoE router gate (`mlp.gate`) is a small linear layer that selects which experts to activate. When quantized to FP4, the routing decisions become noisy and the model produces degraded output. Fix in `qwen3_next.py` line 256:

```python

# Before (broken):

self.gate = ReplicatedLinear(…, quant_config=quant_config)

# After (fixed):

self.gate = ReplicatedLinear(…, quant_config=None)

```

This forces the gate to stay at BF16 precision regardless of the quantization config. Rebuilt the Docker image after this change.

**
**Fix 3: MARLIN MoE backend (critical for SM121/Blackwell)****

The default `FLASHINFER_CUTLASS` MoE backend generates PTX code using `cvt with .e2m1x2` instructions (MX microscaling format conversion) that **SM121 (GB10) does not support**. This causes:

```

ptxas fatal: Ptx assembly aborted due to errors

Instruction ‘cvt with .e2m1x2’ not supported on .target ‘sm_121’

```

This crashes during CUDA graph capture AND during profiling warmup — `–enforce-eager` doesn’t help.

The fix is to force the **MARLIN** MoE backend via environment variables:

```

VLLM_USE_FLASHINFER_MOE_FP4=0

VLLM_NVFP4_GEMM_BACKEND=marlin

VLLM_TEST_FORCE_FP8_MARLIN=1

```

MARLIN actually turns out to be more memory-efficient — it leaves more room for KV cache.

**
**Fix 4: System tuning for unified memory****

On DGX Spark, GPU and system RAM are the same 128 GB. During model loading (~77 GB weights), the kernel aggressively swaps application pages to keep file cache. These sysctl settings help:

```

vm.swappiness=1

vm.dirty_bytes=268435456

vm.dirty_background_bytes=134217728

```

**
—**

**Working serve configuration**

```bash

docker run -d --name qwen35-modelopt \

–network host --gpus all --ipc=host \

-v /path/to/txn545-Qwen3.5-122B-A10B-NVFP4:/models/qwen35:ro \

-v ~/.cache/vllm:/root/.cache/vllm \

-e MODEL=/models/qwen35 \

-e PORT=8080 \

-e GPU_MEMORY_UTIL=0.76 \

-e MAX_MODEL_LEN=262144 \

-e MAX_NUM_SEQS=16 \

-e DTYPE=auto \

-e TASK=generate \

-e TRUST_REMOTE_CODE=true \

-e VLLM_USE_FLASHINFER_MOE_FP4=0 \

-e VLLM_NVFP4_GEMM_BACKEND=marlin \

-e VLLM_TEST_FORCE_FP8_MARLIN=1 \

–restart unless-stopped \

dgx-vllm-qwen35:v1-gate-fix serve

```

**
**Results on DGX Spark****

| Metric | Value |

| Model size on disk | 77.8 GB |

| GPU memory utilization | 0.76 |

| Context window | 262,144 tokens |

| Max concurrent requests | 16 |

| KV cache | 405,072 tokens (~17 GiB) |

| Startup time | ~11 minutes (weight loading + CUDA graph capture) |

**Inference throughput (single request):**

| Mode | Tokens/sec |

| Nothink (fast) | 8-15 tok/s |

| Think (reasoning) | ~10 tok/s |

| Vision (image input) | ~9 tok/s |

| Time to first token | ~162 ms |

**Multimodal:**
Vision is working — I tested with contract document images and it correctly extracts parties, values, dates, and key terms.

**Think/nothink:**
Both modes work via `chat_template_kwargs: {“enable_thinking”: true/false}` on each request. Single model instance serves both.

**
—**

**Startup note for DGX Spark**

Model loading temporarily spikes to nearly all 128 GB of unified memory. If you run other GPU services (OCR, etc.), stop them before starting vLLM and restart them after the model is loaded. I use a systemd oneshot service that polls the health endpoint and starts other services once vLLM is ready.

Apologies for the broken quantization. The combination of llm-compressor’s MoE limitations and Qwen3.5’s unusual fused expert tensor format was the root cause. txn545’s ModelOpt quantization is the way to go for now. Hope the SM121 fixes save others some debugging time.

**
**Why this model matters for my use case:****
I’m running a document management system where AI analyzes documents — up to 262K tokens, extracting dates, values, parties, obligations, deadlines, and clauses. The workload needs three things simultaneously: long context (documents are big), vision (scanned documents as images), and both fast responses (chat UI) and deep reasoning (background analysis). Qwen3.5-122B-A10B in NVFP4 is the only combination that fits all three on DGX Spark — MoE keeps inference fast (only 10B active parameters), FP4 leaves enough memory for the 262K KV cache, and the intact vision encoder handles document images without a separate model. Smaller models sacrifice context or quality; FP8 doesn’t leave room for concurrent users during long analysis jobs. For anyone building document-heavy AI workflows on Spark, this is currently the sweet spot.

2 Likes

Followed same step. Enabling MTP does not enhance performace. other than that fairly usable.
default

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
txn545/Qwen3.5-122B-A10B-NVFP4 pp1024 1553.58 ± 172.15 673.95 ± 81.26 668.75 ± 81.26 674.00 ± 81.26
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 14.63 ± 0.03 15.00 ± 0.00
txn545/Qwen3.5-122B-A10B-NVFP4 pp4096 2014.01 ± 11.43 2039.52 ± 11.57 2034.32 ± 11.57 2039.57 ± 11.56
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 14.55 ± 0.05 15.00 ± 0.00
txn545/Qwen3.5-122B-A10B-NVFP4 pp1024 @ d4096 2007.79 ± 11.74 2555.96 ± 14.80 2550.76 ± 14.80 2556.01 ± 14.80
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 @ d4096 14.55 ± 0.01 15.00 ± 0.00
txn545/Qwen3.5-122B-A10B-NVFP4 pp4096 @ d4096 1757.46 ± 496.26 5414.74 ± 2648.72 5409.54 ± 2648.72 5414.79 ± 2648.72
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 @ d4096 14.46 ± 0.10 15.00 ± 0.00
txn545/Qwen3.5-122B-A10B-NVFP4 pp1024 @ d8192 1988.64 ± 19.95 4640.39 ± 46.98 4635.19 ± 46.98 4640.44 ± 46.97
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 @ d8192 14.49 ± 0.02 15.00 ± 0.00
txn545/Qwen3.5-122B-A10B-NVFP4 pp4096 @ d8192 1979.91 ± 10.74 6212.12 ± 33.76 6206.92 ± 33.76 6212.17 ± 33.76
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 @ d8192 14.44 ± 0.01 15.00 ± 0.00
txn545/Qwen3.5-122B-A10B-NVFP4 pp1024 @ d16384 1891.47 ± 9.66 9209.18 ± 47.18 9203.97 ± 47.18 9209.23 ± 47.18
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 @ d16384 14.36 ± 0.10 15.00 ± 0.00
txn545/Qwen3.5-122B-A10B-NVFP4 pp4096 @ d16384 1867.27 ± 4.81 10973.72 ± 28.43 10968.52 ± 28.43 10973.77 ± 28.43
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 @ d16384 14.35 ± 0.02 15.00 ± 0.00
txn545/Qwen3.5-122B-A10B-NVFP4 pp1024 @ d32768 1703.94 ± 3.71 19837.48 ± 43.36 19832.28 ± 43.36 19837.54 ± 43.36
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 @ d32768 14.19 ± 0.05 15.00 ± 0.00
txn545/Qwen3.5-122B-A10B-NVFP4 pp4096 @ d32768 1672.33 ± 2.19 22049.46 ± 28.87 22044.26 ± 28.87 22049.51 ± 28.87
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 @ d32768 14.19 ± 0.02 15.20 ± 0.40
txn545/Qwen3.5-122B-A10B-NVFP4 pp1024 @ d65536 1399.52 ± 2.46 47565.49 ± 83.73 47560.29 ± 83.73 47565.54 ± 83.73
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 @ d65536 13.87 ± 0.02 15.00 ± 0.00
txn545/Qwen3.5-122B-A10B-NVFP4 pp4096 @ d65536 1376.48 ± 2.13 50593.20 ± 78.52 50588.00 ± 78.52 50593.25 ± 78.52
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 @ d65536 13.85 ± 0.03 15.00 ± 0.00

enable MTP

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
txn545/Qwen3.5-122B-A10B-NVFP4 pp1024 1193.00 ± 374.03 1061.27 ± 623.97 1054.95 ± 623.97 1061.31 ± 623.97
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 9.78 ± 0.08 10.20 ± 0.40
txn545/Qwen3.5-122B-A10B-NVFP4 pp4096 1712.39 ± 177.78 2429.36 ± 295.01 2423.05 ± 295.01 2429.42 ± 295.01
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 10.17 ± 0.30 11.40 ± 0.49
txn545/Qwen3.5-122B-A10B-NVFP4 pp1024 @ d4096 1826.61 ± 7.12 2810.12 ± 11.02 2803.81 ± 11.02 2810.18 ± 11.02
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 @ d4096 10.36 ± 0.08 11.40 ± 0.49
txn545/Qwen3.5-122B-A10B-NVFP4 pp4096 @ d4096 1649.11 ± 449.40 5687.53 ± 2616.55 5681.22 ± 2616.55 5687.58 ± 2616.55
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 @ d4096 10.33 ± 0.21 11.80 ± 0.40
txn545/Qwen3.5-122B-A10B-NVFP4 pp1024 @ d8192 1840.47 ± 5.52 5014.86 ± 14.86 5008.54 ± 14.86 5014.92 ± 14.86
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 @ d8192 10.31 ± 0.08 11.20 ± 0.40
txn545/Qwen3.5-122B-A10B-NVFP4 pp4096 @ d8192 1827.52 ± 4.96 6730.77 ± 18.26 6724.46 ± 18.26 6730.83 ± 18.26
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 @ d8192 10.32 ± 0.08 11.40 ± 0.49
txn545/Qwen3.5-122B-A10B-NVFP4 pp1024 @ d16384 1766.94 ± 2.26 9858.62 ± 12.82 9852.30 ± 12.82 9858.67 ± 12.82
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 @ d16384 10.33 ± 0.09 11.40 ± 0.49
txn545/Qwen3.5-122B-A10B-NVFP4 pp4096 @ d16384 1738.18 ± 9.84 11789.48 ± 67.12 11783.17 ± 67.12 11789.53 ± 67.12
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 @ d16384 10.24 ± 0.09 11.20 ± 0.40
txn545/Qwen3.5-122B-A10B-NVFP4 pp1024 @ d32768 1587.07 ± 7.52 21299.58 ± 101.52 21293.26 ± 101.52 21299.63 ± 101.52
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 @ d32768 10.20 ± 0.10 11.00 ± 0.00
txn545/Qwen3.5-122B-A10B-NVFP4 pp4096 @ d32768 1557.21 ± 2.07 23680.09 ± 31.51 23673.77 ± 31.51 23680.14 ± 31.52
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 @ d32768 10.26 ± 0.09 11.00 ± 0.00
txn545/Qwen3.5-122B-A10B-NVFP4 pp1024 @ d65536 1311.19 ± 4.15 50770.52 ± 161.32 50764.21 ± 161.32 50770.57 ± 161.32
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 @ d65536 10.01 ± 0.09 11.00 ± 0.00
txn545/Qwen3.5-122B-A10B-NVFP4 pp4096 @ d65536 1291.77 ± 0.98 53911.32 ± 41.07 53905.01 ± 41.07 53911.37 ± 41.07
txn545/Qwen3.5-122B-A10B-NVFP4 tg128 @ d65536 9.96 ± 0.02 11.00 ± 0.00

I think you’d get better performance from INT4 + Autoround quant, which also seems to offer slightly better accuracy than NVFP4.

2 Likes

Thank you for the excellent work and this informative thread. I pulled the repository this morning to get my local folder up to date. Cleared my build cache and rebuilt the vllm image. Now I’m able to run this model using this launch command:
./launch-cluster.sh -t vllm-node-tf5-qwen35 --solo \
–apply-mod mods/fix-qwen3.5-autoround \
-e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
exec vllm serve Intel/Qwen3.5-122B-A10B-int4-AutoRound \
–max-model-len 200000 \
–max-num-batched-tokens 8192 \
–enable-auto-tool-choice \
–tool-call-parser qwen3_coder \
–reasoning-parser qwen3 \
–gpu-memory-utilization 0.85 \
–host 0.0.0.0 \
–port 8000 \
–load-format fastsafetensors \
–enable-prefix-caching \
–kv-cache-dtype fp8 \
–trust-remote-code \
–mm-encoder-tp-mode data \
–mm-processor-cache-type shm

Is it normal to take several minutes for these modesl to load? I’ve had the same issue loading other models like gpt-oss-120b and Qwen3-Coder-Next-FP8 I can’t imagnine model swapping if loading takes this long…

Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, 2it/s]
Loading safetensors checkpoint shards: 7% Completed | 1/14 [00:39<08:35, 39.65s/it]
Loading safetensors checkpoint shards: 14% Completed | 2/14 [01:20<08:03, 40.26s/it]
Loading safetensors checkpoint shards: 21% Completed | 3/14 [02:01<07:26, 40.55s/it]
Loading safetensors checkpoint shards: 29% Completed | 4/14 [02:40<06:41, 40.17s/it]
Loading safetensors checkpoint shards: 36% Completed | 5/14 [03:19<05:56, 39.62s/it]
Loading safetensors checkpoint shards: 43% Completed | 6/14 [03:59<05:18, 39.83s/it]
Loading safetensors checkpoint shards: 50% Completed | 7/14 [04:40<04:41, 40.17s/it]
Loading safetensors checkpoint shards: 57% Completed | 8/14 [05:20<04:01, 40.25s/it]
Loading safetensors checkpoint shards: 64% Completed | 9/14 [05:59<03:18, 39.67s/it]
Loading safetensors checkpoint shards: 71% Completed | 10/14 [06:40<02:40, 40.00s/it]
Loading safetensors checkpoint shards: 79% Completed | 11/14 [07:19<01:59, 39.85s/it]
Loading safetensors checkpoint shards: 86% Completed | 12/14 [07:59<01:19, 39.75s/it]
Loading safetensors checkpoint shards: 93% Completed | 13/14 [08:27<00:36, 36.41s/it]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [08:43<00:00, 30.16s/it]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [08:43<00:00, 37.40s/it]
(EngineCore_DPO pid=323) INFO 03-08 21:41:55 [default_loader.py:293] Loading weights took 523.79 seconds
(EngineCore_DPO pid=323) INFO 03-08 21:41:57 [gpu_model_runner.py:4342] Model loading took 62.65 GiB memory and 540.323175 seconds
(EngineCore_DPO pid=323) INFO 03-08 21:41:57 [gpu_model_runner.py:5258] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DPO pid=323) INFO 03-08 21:42:18 [backends.py:913] Using cache directory: /root/.cache/vllm/torch_compile_cache/f44c80d73/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DPO pid=323) INFO 03-08 21:42:18 [backends.py:973] Dynamo bytecode transform time: 5.57 s
(EngineCore_DPO pid=323) INFO 03-08 21:42:20 [backends.py:283] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 0.943 s
(EngineCore_DPO pid=323) INFO 03-08 21:42:56 [monitor.py:35] torch.compile and initial profiling run took 43.85 s in total

Yes, it takes around 10 minutes for me to get it working. I am using FP4 version as I mentioned. There may minor differences though during loading.

I just timed startup using eugr’s vllm-node-tf5 using effectively the recipe (custom script because I point to a locally downloaded cache) and total load time from running the script to server available is under 5 minutes - including CUDA graphs etc.

Just the safetensor load required 69.25 seconds. This is on the first party Nvidia DGX Spark, which has a gen5 SSD. Though I’d be quite surprised to see almost order of magnitude in difference.

Edit: You definitely aren’t using Fastsafetensor loading. If you were, the log would say “Loading safetensors using Fastsafetensor loader”. Enable that flag with --load-format fastsafetensors.

Thanks for the tip @joshua.dale.warner — 69s load time is impressive! I tested --load-format fastsafetensors on my Spark with txn545’s ModelOpt NVFP4 quantization and wanted to share a caveat for others.

TL;DR: fastsafetensors + high gpu-memory-utilization = system freeze on unified memory.

I built a new image layer with pip install fastsafetensors on top of Avarok’s dgx-vllm (vLLM 0.16.0rc2) and launched with my production config:

–gpu-memory-utilization 0.84
–max-model-len 262144
–max-num-seqs 16
–load-format fastsafetensors

The machine immediately went into swap thrash — SSH unresponsive, required a power cycle to recover.

Root cause: At 0.84 utilization, vLLM pre-allocates ~107 GB out of 128 GB unified memory. fastsafetensors does bulk GPU-direct loading which needs a temporary buffer for the model weights (~70 GB) during transfer. On a discrete GPU system this comes from separate system RAM, but on Spark’s unified memory it competes for the same 128 GB pool. 107 GB + loading buffer > 128 GB → OOM → thrash.

Your examples use --gpu-memory-utilization 0.7 (~90 GB), which leaves ~38 GB headroom — enough for the loading buffer. That’s why it works for you.

For my use case (262K tokens, 16 concurrent sessions), the KV cache from 0.84 utilization is more valuable than faster startup. ~11 min startup is acceptable since restarts are rare.

If fastsafetensors is to be used, gpu-memory-utilization must be kept at 0.76 or below. At 0.84+ with a 70 GB model, unified memory will run out during loading.

I can’t get this to run reliably. Using pi coding agent, it just stops calling tools immediately. Using Eugr’s container. How are you running it?

@agustinr — As I mentioned in my post above, we’re running Avarok’s dgx-vllm image (vLLM 0.16.0rc2) with a MoE gate fix, not Eugr’s stock container. We have 49 AI tools working reliably. Here’s what’s likely causing your issue:

  1. Tool call parser flags — These are critical and easy to miss:
    VLLM_EXTRA_ARGS=–enable-auto-tool-choice --tool-call-parser qwen3_xml --default-chat template-kwargs
    {“enable_thinking”:false}
    Without --enable-auto-tool-choice --tool-call-parser qwen3_xml, vLLM won’t parse or emit tool calls at all.

  2. Nothink mode for tools — Think mode wraps output in tags that break tool call parsing in most agents. Make sure thinking is disabled for tool-calling requests.

  3. MoE gate fix — In qwen3_next.py:256, the MoE router gate must stay BF16:
    self.gate = ReplicatedLinear(…, quant_config=None) # NOT quant_config=quant_config
    Without this, routing is subtly broken — tool calls break first because they need precise structured JSON.

  4. Required env vars for SM121:
    VLLM_NVFP4_GEMM_BACKEND=marlin
    VLLM_TEST_FORCE_FP8_MARLIN=1
    VLLM_USE_FLASHINFER_MOE_FP4=0

  5. Chat template bug — Unsloth found a bug in all Qwen3.5 chat templates affecting tool calling. Check if your agent framework is using the corrected template.

What does “stops calling tools” look like exactly? Does the model generate text but not in tool call format, or does it produce empty/truncated output?

They addressed it recently, I believe.

I tried couple of times but could not make it work. Let me try tomorrow morning with fresh boot.

I believe it produces a broken tool call, likely related to thinking tags. Unsloth chat template helps to some extent, but doesn’t fix it completely. I guess disabling thinking should help with that.

BTW, Intel Autoround quants will give you better performance than NVFP4 on Marlin (~27 t/s on a single Spark) and comparable accuracy.

If you run our community Docker, the recipe is included.

I am looking forward to testing it this week. I had a few presentations and did not want to ‘break’ anything. The last one is tomorrow. I will test/try then!

Thank you by the way!