Flux.2 Klein 9B on DGX Spark: 2.5x Faster Inference and 59% Lower VRAM with Vitoom Nunchaku

The motivation was simple: original large image generation models can be painfully slow on DGX Spark.

vitoom-nunchaku is optimized specifically for DGX Spark, with the goal of accelerating image inference and reducing VRAM usage.

Flux.2 Klein 9B Inference Benchmark

Environment: DGX Spark
Model: Flux.2 Klein 9B
Steps: 8

Configuration Load Time Inference Speed Inference Time Peak VRAM Transformer VRAM Text Encoder VRAM
fp16, no pretouch 249.748s 1.25s/it 10s 37.14GB 16.91GB 15.26GB
fp16, pretouch 15.999s 1.25s/it 10s 37.14GB 16.91GB 15.26GB
Nunchaku quantized transformer, pretouch 15.999s 1.82it/s 4s 25.61GB 5.40GB 15.26GB
Nunchaku quantized transformer + text encoder, pretouch 15.999s 1.83it/s 4s 15.21GB 5.40GB 4.86GB

Summary

Enabling pretouch significantly improves model loading time on DGX Spark. For the fp16 model, load time drops from 249.748s to 15.999s, which is about a 15.6x speedup. It does not change inference speed or VRAM usage.

Using the Nunchaku quantized transformer improves inference performance substantially. End-to-end inference time for 8 steps drops from 10s to 4s, giving a 2.5x total inference speedup. Peak VRAM decreases from 37.14GB to 25.61GB, a reduction of 11.53GB.

Adding the Nunchaku quantized text encoder further reduces memory usage. Peak VRAM drops to 15.21GB, which is 21.93GB less than fp16, or about a 59% reduction. Inference speed remains roughly the same as transformer-only quantization, at around 1.83it/s.

Nunchaku wheel:

  • Hugging Face repo: tonera/vitoom-nunchaku
  • File: nunchaku-1.3.0.dev20260622+cu13.0torch2.11-cp311-cp311-linux_aarch64.whl

Quantized text encoder:

  • Hugging Face repo: tonera/Qwen3-text-Nunchaku

Vitoom Nunchaku for DGX Spark

This version of vitoom-nunchaku is optimized specifically for DGX Spark. It is designed to accelerate image inference and reduce VRAM usage.

In addition to Flux.2 Klein 9B, it also supports the following image generation models:

  • tonera/Qwen-Image-2512-Lightning-Nunchaku
  • tonera/Chroma1-HD-SVDQ
  • tonera/Qwen-Image-Edit-2511-Lightning-Nunchaku
  • tonera/FLUX.2-klein-9b-kv-Nunchaku
  • tonera/FLUX.2-klein-4B-Nunchaku
  • Z-Image-Turbo
  • Flux series
  • SDXL series

Of course, you can also use my previously open-source Vitoom project to build your own local DGX Spark AI workstation: Vitoom: Browser-first multimodal AIGC + AI Agent for DGX Spark / RTX Spark . It not only optimizes image inference performance but also supports custom local models for text, video, and audio.

1 Like