vLLM custom for DGX Spark - STREAM LOADING and automatic KV cache

Hi everyone,

I’d like to share a custom build of vLLM 0.17.1 I’ve been working on for DGX Spark (GB10 / SM121). It focuses on making large models actually runnable on our 128 GiB unified memory.

Main feature: STREAM LOADING

When new large models are released, DGX Spark users often have to wait for someone to publish a 4-bit pre-quantized version before we can try them — even when the model would fit in 128 GiB if only it could be quantized to 4 bits on the fly.

The reason is that default vLLM has to hold both the full BF16 weights and the converted 4-bit data in memory at the same time during loading. STREAM LOADING removes this constraint by reading only the necessary expert / layer chunks from storage, on-the-fly 4-bit quantizing them, and placing the result on the GPU.

The following BF16 / FP8 models (i.e. NOT pre-quantized Int4 or NVFP4) have already been confirmed to run on DGX Spark:

  • Qwen3.5-397B-A17B-FP8 (about 96.7 GiB weights/GPU at TP=2)
  • Nemotron3-120B-A12B-BF16 (TP=1, TP=2)
  • Qwen3.5-122B-A10B (TP=1, TP=2)

Models whose shards are not laid out in expert order (such as Nemotron) are also supported via random-access loading.

(Trade-off: startup time grows significantly.)

Supporting features

  • NF4 quantization (a sub-mode of MXFP4): When pure MXFP4 (E2M1) loses too much output quality, NF4 uses a normal-distribution-based 16-level partition to recover precision. It is launched within the --quantization mxfp4 framework and is selectable per-layer via environment variables such as VLLM_NF4_LAYERS.
  • Automatic KV cache allocation: No more --gpu-memory-utilization tuning by hand. The default is now auto. The patch first allocates a minimal KV cache, then after torch.compile and FlashInfer JIT it releases that, recomputes the actually available memory (with the caching allocator’s fragmentation pool taken into account), and re-allocates the KV cache up to the limit.

Installation: just two pip install commands.
Repository: GitHub - namake-taro/vllm-custom · GitHub

The README covers installation, environment variables, single-request and 10-concurrent decode throughput benchmarks for several models, and example launch commands.

This is a personal research project, provided as-is. I’d love to hear how it goes (or doesn’t) for your use cases.

Hello!

Thank you very much for your research.

Did you already consider implementing your vLLM stream loading on top of GreenBoost? And in the event you did it and preferred doing otherwise, do you mind sharing your design decisions, please?

GreenBoost [0,1] is a CUDA memory orchestrator for Linux, implemented as a kernel module. It allows to extend, in a transparent way, the available memory using system NVMe storage (of course paying the performance penalty). Its goal is to allow running models that exceed the system memory without modifying the inference software.

In a way, stream loading could be implemented on its top, especially for MoE models. The model weights would be moved from memory regions mapped onto NVMe storage to native memory, as needed. And on the fly quantisation could be pipelined accordingly.

I welcome your thoughts and insights on the above. Thanks again.

[0] https://gitlab.com/IsolatedOctopi/nvidia_greenboost

[1] https://forums.developer.nvidia.com/t/nvidia-greenboost-kernel-modules-opensourced/363486?u=adg1

@adg1 Thanks for pointing me to GreenBoost — I went through the README and the relevant parts of the source.

To get the conclusion out of the way: I don’t think implementing STREAM LOADING on top of GreenBoost would work well on DGX Spark / GB10. Both projects address the question “how do you actually run large LLMs?”, but they target different hardware classes and go in essentially opposite directions. Let me lay out my understanding — please correct me if I’m misreading anything.

As I read it, GreenBoost is built around a discrete GPU with separate system RAM:

  • T1: NVIDIA RTX GPU GDDR7 (~336 GB/s)
  • T2: System RAM (~32-64 GB/s, over PCIe 4/5)
  • T3: NVMe swap (~1.8 GB/s)

It’s exactly because T1 and T2 are physically separated by the PCIe bus that exposing T2 as “extended VRAM” via pinned pages + DMA-BUF makes sense. The core mechanism, as I understand it, is that libgreenboost_cuda.so intercepts cudaMalloc etc., and the cuDeviceTotalMem_v2 hook reports “T1+T2 combined size” so frameworks like Ollama don’t fall back to CPU compute. Is that roughly right?

GB10 is a unified memory architecture: the GPU and the CPU share the same 128 GiB DRAM (~273 GB/s), and cudaMalloc() / malloc() come from the same physical pool. That breaks the foundation in three ways:

  1. Tiering is meaningless — GreenBoost’s T1 and T2 are physically the same memory on GB10.
  2. NVMe destroys decode performance — the only “added capacity” left is T3, but at 1.8 GB/s vs 273 GB/s (≈152x), spilling any weight to NVMe collapses bandwidth-bound decode. With MoE models in particular, a routed expert on NVMe would stall the GPU every step.
  3. The cuDeviceTotalMem_v2 hook would collide with vLLM’s automatic KV cache —
    double-counting the physical capacity would make vLLM allocate KV cache far beyond what’s actually available and OOM immediately.

GreenBoost places weights in physical memory as-is — Qwen3.5-397B-A17B-FP8 (~397 GB) won’t fit in 128 GiB × TP=2 + NVMe by tiering alone. STREAM LOADING converts the weights to MXFP4/NF4 during loading, shrinking them to roughly 1/2 (from FP8) or 1/4 (from BF16), which is what makes a 397B MoE runnable across two DGX Sparks at all. The trade-off is much longer startup, but once loaded everything stays in unified memory and inference can use the full ~273 GB/s.

So I see them less as competitors and more as different solutions for different hardware classes — GreenBoost for discrete GPUs with tight VRAM, STREAM LOADING for GB10 unified memory. Please let me know if I’m misunderstanding any part of GreenBoost.

I didn’t include the benchmark numbers in my original post above (they’re in the GitHub README), so adding them here for reference.

Measured with the following command:

llama-benchy --base-url http://localhost:8000/v1 --pp 2048 --tg 128 --concurrency 1 10 --runs 5 --latency-mode generation --enable-prefix-caching 
  • Decode throughput, single request (tg128, c1)
Model TP=1 TP=2
gpt-oss-120b 64.52 79.55
Qwen3.5-35B-A3B 64.37 78.45
Qwen3.5-27B 12.07 20.46
Qwen3.5-122B-A10B 28.17 41.90
Nemotron3-120B-A12B-BF16 24.11 36.58
Qwen3.5-397B-A17B-FP8 - 26.83
  • Decode throughput, 10 concurrent requests (tg128, c10)
Model TP=1 TP=2
gpt-oss-120b 165.02 198.19
Qwen3.5-35B-A3B 208.31 161.37
Qwen3.5-27B 92.54 95.18
Qwen3.5-122B-A10B 74.90 75.11
Nemotron3-120B-A12B-BF16 84.56 61.48
Qwen3.5-397B-A17B-FP8 - 50.98

Thank you for your thoughtful analysis. Are you planning to submit a PR upstream?

I think your architectural characterisation is correct. Specifically, system architecture collapses the tiering found on discrete GPUs, and NVMe IO is the real bottleneck – However, this is just the same bottleneck internalised by STREAM LOADING at load time, given that online quantisation can be implemented anyway.

The reason I am is asking is because vLLM is a moving target: everyday PRs are merged in a codebase that is inherently complex. To fork it (and having to rebase your feature set on a frequent basis) can translate in maintenance nightmare. Conversely, a virtualised memory pool, albeit slow, promises to implement the same feature set, while minimising changes at the inference software layer.

Overall, save the NVMe performance bottleneck that is inescapable in both designs, there are just tradeoffs to be made.

Thank you for this! I’ve very curious about this, how is the speed for “chat”, I’m always on the hunt to find a good model (bigger) that will respond quickly to chat, specially I’m using Agent Zero and it needs a chat model front end. But I dont want to wait 30seconds to 5 minutes for an answer. How do these models hold up? For example I used Eugr’s Gemma 4 on his vLLM and it’s instantaneous on open webUI. Its crazy fast. I’m not expecting that fast for big models, but if you are getting fast speeds I’m curious. Also your RAM usage! my other 2 models needed eat about 16-17GB of RAM, the rest can go to the chat model (well I would like to stay no more than 120GB, lower if possible). What do you think what have you seen in your models and testing. Sadly I’m too dumb to read your measurement data :( I only have one spark. Its only job is going to be a backend for A0. It won’t have a ton of concurrent connections, maybe 6, but in reality maybe 2 or 3.

Thanks!

Thank you for sharing, I will move this to GB10 Projects

Thanks for the kind reply.

First, let me clarify one point about where the NVMe cost actually lives. STREAM LOADING does not internalise an inference-time NVMe bottleneck. It only touches NVMe during the initial model load. Online quantisation runs precisely while the weights are being read from storage into unified memory, and once loading is finished the entire (quantised) model is resident in unified memory. As a result, the NVMe “bottleneck” simply does not appear during inference at all.

By contrast, if you tried to run GreenBoost on DGX Spark, a virtualised memory pool would have to move weights across the NVMe boundary on every decode step (or, at best, on every layer cache miss). That means each token can incur a substantial amount of storage traffic. For a 100B-class FP8 model, for example, roughly 100 GB of weights are touched per token in dense form. For a sparse MoE such as gpt-oss-120b that drops to a fraction of the total, but the activated experts still amount to several GB. The NVMe SSD shipped with DGX Spark — for instance, my DGX Spark-compatible Lenovo ThinkStation PGX has a Phison ESL04TBTLCZ-27J4-TYN (PCIe Gen4 x4) — measures around 3.9 GB/s on sequential reads. On that basis, I would expect a dense model to land somewhere around 0.03 token/s, and even a sparse MoE under ideal cache behaviour would likely sit at around 1 tok/s. STREAM LOADING, on the other hand, runs at close to 80 tok/s on the same hardware. That is not a gap that tuning can close — for MoE this is likely a two-orders-of-magnitude gap, and for dense models it would be three orders of magnitude.

I do not think this can fairly be described as “just a tradeoff.” The set of users — research or production — who can accept an inference path that is 100 - 1000× slower in exchange for code maintainability is, I think, extremely narrow.

On upstreaming: I would like to push as much as I can. Realistically, though, the parts that make this custom fork interesting — STREAM LOADING, the two-phase KV cache sizing, per-layer quantisation dispatch, the warmup-driven KV expansion
— are all tightly coupled to the GB10 unified-memory model. They modify vLLM’s discrete-GPU assumptions in fairly invasive ways. I think it would be very difficult to ask the vLLM maintainers to take these on.

Thanks for the reply, Anglerfish.

Reading through your requirements — 16–17 GB models, a 120 GB total budget, 2–3 concurrent users, and chat-style responsiveness — my impression is that your workload sits a little outside what this fork was designed for. The main mechanisms in this fork all exist to push large models into the GB10 unified-memory budget. For models that already fit comfortably, those mechanisms are mostly overhead, and the benefit you would get from them is probably limited.

For the purposes you describe, I suspect Eugr’s docker, which you are already using, is the best choice.

I hope this helps point you in the right direction.

Thank you for your prompt reply.

This is not how GreenBoost works, and, as a result, the conclusions you infer about the expected performances do not follow from the premises.

While working with models whose post-quantisation memory footprint fits into physical memory, as in your gpt-oss-120b example, this is what would happen: With GreenBoost model weights cross the NVMe boundary only once, namely at memory allocation time. As they fit the memory they would stay resident in the assigned memory region as long the process runs; if the assigned memory lies on the real LPDDR memory – as it is possible with GreenBoost tiering – they would just stick to that memory pool. As per the possibility that memory is paged out, this applies equally well to ordinary memory management (and can be appropriately disabled). The exact access times offered by PCIe are immaterial in the comparison, as they affect equally well both GreenBoost and STREAM LOADING. Ditto for cache contention. Hence, what change in the GreenBoost (or equivalent) scenario is simply the mechanism that operationalises the loading of model weights into memory: with STREAM LOADING this is part of the forked code base, while with the alternative approach part of the loading logics are delegated to a third-party memory management layer. And this the concrete trade off, that in full fairness, I was referring to.

Where the performance has the possibility to differ is in those scenarios where the post-quantisation memory footprint does not fit into physical memory. But this kind of scenario is precluded to STREAM LOADING, but open to GreenBoost where possibly all disk space can appear as some kind of slow memory. Therefore there is nothing to be compared here.

I wish you good luck with its maintenance. I know this is not going to be an easy job.

Thanks for the follow-up, adg1.

First, an apology: using gpt-oss-120b as the example last time was my mistake. gpt-oss-120b is a model whose footprint fits inside DGX Spark unified memory, so it’s not a fair case for comparing the two approaches. In that situation, you’re absolutely right — GreenBoost loads the weights once, they stay resident in LPDDR, and there’s no need to cross the NVMe boundary again at inference. For a model in that range, the difference between the two really does reduce to “where the loading logic lives,” exactly as you described.

The fork is focused on a different size class. Consider, for instance, running Qwen3.5-122B-A10B (BF16) on a single DGX Spark (TP=1), or Qwen3.5-397B-A17B-FP8 on TP=2.

Suppose, for the sake of illustration, that we have a model with 400 GB of BF16 weights. What STREAM LOADING does is read those weights chunk by chunk and place them into memory as 100 GB of quantized data — so in principle, 100 GB of memory is enough to load the model. And NVMe is never touched at inference time.

GreenBoost on its own cannot avoid NVMe access here, since it doesn’t include a quantization mechanism. That said, in principle one could combine GreenBoost with a separate quantization layer to reach the same end result — quantized weights placed into memory.

The catch is that existing quantization mechanisms have to hold both the raw weights and the quantized weights in memory at the same time during loading. So in this scenario, 400 GB + 100 GB = 500 GB has to be addressable on GreenBoost’s virtual memory simultaneously. Most of that 500 GB will inevitably end up on the NVMe side, which means the conversion step is bottlenecked by NVMe bandwidth — substantially slower than DRAM — and the startup time would be commensurate with that speed.

If the trade-off against maintainability that you mentioned was referring to this loading-time cost, then I do understand the framing.

That said, you’re right that there is a regime above this one — models whose footprint doesn’t fit in physical memory even after quantization — which is out of scope for STREAM LOADING. In that range, a virtualised memory layer like GreenBoost could indeed be a viable option.

Thank you for your kind words. I’ll keep working on it. And thank you for a genuinely productive discussion.