Hi everyone,
I’d like to share a custom build of vLLM 0.17.1 I’ve been working on for DGX Spark (GB10 / SM121). It focuses on making large models actually runnable on our 128 GiB unified memory.
Main feature: STREAM LOADING
When new large models are released, DGX Spark users often have to wait for someone to publish a 4-bit pre-quantized version before we can try them — even when the model would fit in 128 GiB if only it could be quantized to 4 bits on the fly.
The reason is that default vLLM has to hold both the full BF16 weights and the converted 4-bit data in memory at the same time during loading. STREAM LOADING removes this constraint by reading only the necessary expert / layer chunks from storage, on-the-fly 4-bit quantizing them, and placing the result on the GPU.
The following BF16 / FP8 models (i.e. NOT pre-quantized Int4 or NVFP4) have already been confirmed to run on DGX Spark:
- Qwen3.5-397B-A17B-FP8 (about 96.7 GiB weights/GPU at TP=2)
- Nemotron3-120B-A12B-BF16 (TP=1, TP=2)
- Qwen3.5-122B-A10B (TP=1, TP=2)
Models whose shards are not laid out in expert order (such as Nemotron) are also supported via random-access loading.
(Trade-off: startup time grows significantly.)
Supporting features
- NF4 quantization (a sub-mode of MXFP4): When pure MXFP4 (E2M1) loses too much output quality, NF4 uses a normal-distribution-based 16-level partition to recover precision. It is launched within the
--quantization mxfp4framework and is selectable per-layer via environment variables such asVLLM_NF4_LAYERS. - Automatic KV cache allocation: No more
--gpu-memory-utilizationtuning by hand. The default is nowauto. The patch first allocates a minimal KV cache, then aftertorch.compileand FlashInfer JIT it releases that, recomputes the actually available memory (with the caching allocator’s fragmentation pool taken into account), and re-allocates the KV cache up to the limit.
Installation: just two pip install commands.
Repository: GitHub - namake-taro/vllm-custom · GitHub
The README covers installation, environment variables, single-request and 10-concurrent decode throughput benchmarks for several models, and example launch commands.
This is a personal research project, provided as-is. I’d love to hear how it goes (or doesn’t) for your use cases.