Flux.2 Klein 9B on DGX Spark: 2.5x Faster Inference and 59% Lower VRAM with Vitoom Nunchaku

tonera · June 25, 2026, 3:30am

The motivation was simple: original large image generation models can be painfully slow on DGX Spark.

vitoom-nunchaku is optimized specifically for DGX Spark, with the goal of accelerating image inference and reducing VRAM usage.

Flux.2 Klein 9B Inference Benchmark

Environment: DGX Spark
Model: Flux.2 Klein 9B
Steps: 8

Configuration	Load Time	Inference Speed	Inference Time	Peak VRAM	Transformer VRAM	Text Encoder VRAM
fp16, no pretouch	249.748s	1.25s/it	10s	37.14GB	16.91GB	15.26GB
fp16, pretouch	15.999s	1.25s/it	10s	37.14GB	16.91GB	15.26GB
Nunchaku quantized transformer, pretouch	15.999s	1.82it/s	4s	25.61GB	5.40GB	15.26GB
Nunchaku quantized transformer + text encoder, pretouch	15.999s	1.83it/s	4s	15.21GB	5.40GB	4.86GB

Summary

Enabling pretouch significantly improves model loading time on DGX Spark. For the fp16 model, load time drops from 249.748s to 15.999s, which is about a 15.6x speedup. It does not change inference speed or VRAM usage.

Using the Nunchaku quantized transformer improves inference performance substantially. End-to-end inference time for 8 steps drops from 10s to 4s, giving a 2.5x total inference speedup. Peak VRAM decreases from 37.14GB to 25.61GB, a reduction of 11.53GB.

Adding the Nunchaku quantized text encoder further reduces memory usage. Peak VRAM drops to 15.21GB, which is 21.93GB less than fp16, or about a 59% reduction. Inference speed remains roughly the same as transformer-only quantization, at around 1.83it/s.

Nunchaku wheel:

Hugging Face repo: tonera/vitoom-nunchaku
File: nunchaku-1.3.0.dev20260622+cu13.0torch2.11-cp311-cp311-linux_aarch64.whl

Quantized text encoder:

Hugging Face repo: tonera/Qwen3-text-Nunchaku

Vitoom Nunchaku for DGX Spark

This version of vitoom-nunchaku is optimized specifically for DGX Spark. It is designed to accelerate image inference and reduce VRAM usage.

In addition to Flux.2 Klein 9B, it also supports the following image generation models:

tonera/Qwen-Image-2512-Lightning-Nunchaku
tonera/Chroma1-HD-SVDQ
tonera/Qwen-Image-Edit-2511-Lightning-Nunchaku
tonera/FLUX.2-klein-9b-kv-Nunchaku
tonera/FLUX.2-klein-4B-Nunchaku
Z-Image-Turbo
Flux series
SDXL series

Of course, you can also use my previously open-source Vitoom project to build your own local DGX Spark AI workstation: Vitoom: Browser-first multimodal AIGC + AI Agent for DGX Spark / RTX Spark . It not only optimizes image inference performance but also supports custom local models for text, video, and audio.

Topic		Replies	Views
Why Turboquant saves DGX twice DGX Spark / GB10	134	11882	May 31, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2878	March 26, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	11584	April 9, 2026
Can someone please just help me set the DGX Spark up for optimal LLM use? DGX Spark / GB10 llama	11	1112	June 20, 2026
DGX Spark GB10 / vLLM 0.19.1: TurboQuant KV cache integration results on Qwen3.5 and Nemotron, including gather-free Triton decode and CUDA WPH decode DGX Spark / GB10 Projects nemotron	5	1736	April 7, 2026
DGX Spark performance DGX Spark / GB10	49	6213	February 13, 2026
Question on Inference Performance Results of Qwen3 235B A22B on 2× DGX Spark DGX Spark / GB10 cuda	5	783	December 19, 2025
Image diffusion speeds DGX Spark / GB10	9	707	June 29, 2026
Slow inference with 31b model Gemma 4? Optimizations? DGX Spark / GB10	21	4901	June 11, 2026
Lossless 7.67× LoRA / 8.35× Full FT speedup for Qwen3.5 on DGX Spark (GB10, sm_121a) DGX Spark / GB10 performance , spark	3	443	May 20, 2026

Flux.2 Klein 9B on DGX Spark: 2.5x Faster Inference and 59% Lower VRAM with Vitoom Nunchaku

Flux.2 Klein 9B Inference Benchmark

Summary

Vitoom Nunchaku for DGX Spark

Related topics