Hi everyone,
I’ve quantized Qwen3.5-122B-A10B (Alibaba’s latest multimodal MoE model) from BF16 to NVFP4 so it fits on a single DGX
Spark. The original model is ~234GB which exceeds our 128GB unified memory. This quantized version is 75.6GB, leaving ~52GB
headroom for KV cache and vLLM overhead.
Sharing it here so other Spark owners can use it directly without going through the quantization process.
HuggingFace: alpertor/Qwen3.5-122B-A10B-NVFP4 · Hugging Face
—
About the model
Qwen3.5-122B-A10B is a multimodal Mixture-of-Experts model with:
- 122B total parameters, ~10B active per token
- 256 experts per layer (8 active), 48 layers
- Hybrid attention: DeltaNet (linear) + standard full attention
- Supports text, image, and video understanding
- Think/no-think mode for reasoning tasks
—
Quantization details
Format: NVFP4 (4-bit floating point weights, FP8 per-group scales, group size 16)
Original size: 234 GB (BF16)
Quantized size: 75.6 GB
Compression ratio: ~3.1x
Tool: vllm-project/llm-compressor + compressed-tensors
Calibration: 512 samples from ultrachat_200k, 2048 max sequence length
All MoE experts calibrated: Yes (moe_calibrate_all_experts=True)
Hardware: 4x NVIDIA H100 80GB (Vast.ai, ~1.5 hours)
Quantized to FP4 (reduced precision):
- MoE expert weights — gate_proj, up_proj, down_proj across all 256 experts × 48 layers
- Full attention projection weights (self_attn Q/K/V/O)
- Shared expert weights
Kept at BF16 (full precision):
- lm_head (output generation layer)
- MoE router/gate networks (expert selection)
- Shared expert gate
- DeltaNet / linear attention layers
- Vision encoder (all visual processing)
- Layer norms and embeddings
This means all routing decisions, vision processing, and output generation run at full precision. Only the bulk computation
(expert FFN and attention projections) is quantized.
—
How to use on DGX Spark
Step 1: Download the model (~75.6GB)
pip install huggingface_hub[cli]
huggingface-cli download alpertor/Qwen3.5-122B-A10B-NVFP4 \
--local-dir /models/Qwen3.5-122B-A10B-NVFP4
Step 2: Serve with eugr’s spark-vllm-docker
I’m using @eugr’s spark-vllm-docker which is specifically optimized for DGX Spark. If you haven’t set it up yet, check out
eugr’s repo — it handles all the vLLM configuration for Spark’s unified memory architecture.
Example serving configuration:
–model /models/Qwen3.5-122B-A10B-NVFP4
–quantization compressed-tensors
–trust-remote-code
–max-model-len 4096
I’ll update this thread with actual serving results, throughput numbers, and any configuration tweaks once I have it fully
running.
—
Notes and caveats
- Compatibility: This model requires transformers >= 5.1.0 for the qwen3_5_moe model type. The current llm-compressor
(v0.9.1) officially supports transformers <= 4.57.6, so some patching was required during quantization. The saved model
itself should load fine with vLLM.
- Expert weight packing: llm-compressor correctly quantized and packed the shared expert weights but left the MoE expert
weights in BF16 during save (appears to be a known issue with MoE models). I post-processed the shards to manually pack the
expert weights to NVFP4 format (uint8 packed + FP8 scales). The calibration data was used to determine optimal per-group
scales before packing.
- Quality: I expect typical FP4 quantization accuracy (~1-3% benchmark degradation vs BF16). If anyone runs evals or notices
quality issues, please share your findings.
- Vision: The vision encoder is preserved at full BF16 precision. Multimodal capabilities (image/video understanding) should
work as expected, though I haven’t extensively tested this yet.
—
Quantization process for the curious
For anyone who wants to reproduce or quantize other large models for Spark:
1. Rented 4x H100 80GB on Vast.ai (~$6.40/hr)
2. Used llm-compressor’s oneshot() with QuantizationModifier(scheme=“NVFP4”) and calibration on ultrachat_200k
3. Needed transformers 5.2.0 (patched two compatibility issues with llm-compressor)
4. Post-processed safetensors shards on CPU to pack MoE expert weights from BF16 to uint8 (FP4 packed)
5. Uploaded to HuggingFace
Total cost was under $15 including failed attempts and model downloads.
—
If you try this model on your Spark, please share your experience — especially serving configs that work well, throughput
numbers, and any quality observations. Happy to answer questions about the quantization process.