Rough setup notes from a single-node DGX Spark run with spark-vllm-docker / TF5 image:
- Target model:
zdy1995love/Mistral-Medium-3.5-128B-NVFP4 - Draft model:
mistralai/Mistral-Medium-3.5-128B-EAGLE - Hardware: single DGX Spark / GB10, no tensor parallelism, no cluster
- vLLM:
0.20.2rc1.dev6+g894a02500.d20260504 - Context tested here: 16k
- Loader used here:
--load-format auto
Approximate launch command:
./launch-cluster.sh -t vllm-node-tf5 --solo exec vllm serve \
zdy1995love/Mistral-Medium-3.5-128B-NVFP4 \
--served-model-name mistral-medium-3.5-128b-nvfp4-eagle \
--host 0.0.0.0 \
--port 8021 \
--max-model-len 16384 \
--max-num-batched-tokens 8192 \
--max-num-seqs 1 \
--gpu-memory-utilization 0.75 \
--kv-cache-dtype fp8_e4m3 \
--attention-config.backend FLASHINFER \
--tokenizer-mode mistral \
--tool-call-parser mistral \
--enable-auto-tool-choice \
--reasoning-parser mistral \
--language-model-only \
--disable-hybrid-kv-cache-manager \
--load-format auto \
--enable-prefix-caching \
--speculative-config '{"model":"mistralai/Mistral-Medium-3.5-128B-EAGLE","num_speculative_tokens":3,"method":"eagle","max_model_len":4096}'
I actually launched through a local wrapper image (that adds a copy of llama-server’s webui) using codex to drive the whole process, but the backend vLLM args above should be the important part.
Important caveat: --load-format fastsafetensors loaded faster in smaller tests, but was unstable for this 128B model at 16k context on my Spark, using way more memory for some reason, which led to OOM crashes. --load-format auto was slow but stable. The target weights took about 559 seconds to load. vLLM reported target model memory at 72.91 GiB, GPU KV cache size 89,504 tokens, and max concurrency 5.46x for 16,384-token requests.
Generation Benchmark
Using the chat-completions benchmark that I saw around here a few weeks ago, temperature=0.0, two rounds:
| Test | Run 1 | Run 2 |
|---|---|---|
| Q&A | 256 tokens in 37.09s = 6.9 tok/s | 256 tokens in 37.12s = 6.9 tok/s |
| Code | 512 tokens in 57.68s = 8.9 tok/s | 512 tokens in 57.66s = 8.9 tok/s |
| JSON | 701 tokens in 79.50s = 8.8 tok/s | 701 tokens in 79.44s = 8.8 tok/s |
| Math | 9 tokens in 3.16s = 2.9 tok/s | 9 tokens in 3.16s = 2.8 tok/s |
| LongCode | 2048 tokens in 224.61s = 9.1 tok/s | 2048 tokens in 224.55s = 9.1 tok/s |
Prompt Processing
Method: exact chat-formatted prompt token counts came from the same live vLLM server via POST /tokenize. Each request used a different synthetic prompt with a unique prefix to avoid prefix-cache reuse. I streamed max_tokens=1 and computed prompt_tokens / TTFT, so this is a lower bound because TTFT includes HTTP/request overhead and one decode step.
Warmup:
- 3026 prompt tokens, TTFT 18.203s, 166.2 tok/s
Measured runs:
| Run | Prompt tokens | TTFT | PP lower bound |
|---|---|---|---|
| 1 | 3035 | 15.922s | 190.6 tok/s |
| 2 | 3031 | 15.924s | 190.3 tok/s |
| 3 | 3020 | 15.924s | 189.7 tok/s |
| 4 | 3045 | 15.949s | 190.9 tok/s |
| 5 | 3043 | 15.941s | 190.9 tok/s |
Average measured PP lower bound: 190.5 tok/s.
vLLM’s own 10-second log buckets during the same run showed about 302-304 tok/s prompt throughput with 0% prefix cache hit rate, but the table above is the more conservative endpoint-observed number.
I previously tested on llama.cpp’s llama-server, and saw similar prompt processing performance, but without Eagle, the token generation speed was literally about 2 tokens per second. So, 7 to 9 tokens per second is a huge jump up from that.
I haven’t really tried to do anything serious with this model, so I have no idea if it is any good or not, I just wanted to see what kind of performance was possible on a single DGX Spark node.
