Performance report: Mistral Medium 3.5 128B NVFP4 + EAGLE

Rough setup notes from a single-node DGX Spark run with spark-vllm-docker / TF5 image:

  • Target model: zdy1995love/Mistral-Medium-3.5-128B-NVFP4
  • Draft model: mistralai/Mistral-Medium-3.5-128B-EAGLE
  • Hardware: single DGX Spark / GB10, no tensor parallelism, no cluster
  • vLLM: 0.20.2rc1.dev6+g894a02500.d20260504
  • Context tested here: 16k
  • Loader used here: --load-format auto

Approximate launch command:

./launch-cluster.sh -t vllm-node-tf5 --solo exec vllm serve \
  zdy1995love/Mistral-Medium-3.5-128B-NVFP4 \
  --served-model-name mistral-medium-3.5-128b-nvfp4-eagle \
  --host 0.0.0.0 \
  --port 8021 \
  --max-model-len 16384 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.75 \
  --kv-cache-dtype fp8_e4m3 \
  --attention-config.backend FLASHINFER \
  --tokenizer-mode mistral \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --reasoning-parser mistral \
  --language-model-only \
  --disable-hybrid-kv-cache-manager \
  --load-format auto \
  --enable-prefix-caching \
  --speculative-config '{"model":"mistralai/Mistral-Medium-3.5-128B-EAGLE","num_speculative_tokens":3,"method":"eagle","max_model_len":4096}'

I actually launched through a local wrapper image (that adds a copy of llama-server’s webui) using codex to drive the whole process, but the backend vLLM args above should be the important part.

Important caveat: --load-format fastsafetensors loaded faster in smaller tests, but was unstable for this 128B model at 16k context on my Spark, using way more memory for some reason, which led to OOM crashes. --load-format auto was slow but stable. The target weights took about 559 seconds to load. vLLM reported target model memory at 72.91 GiB, GPU KV cache size 89,504 tokens, and max concurrency 5.46x for 16,384-token requests.

Generation Benchmark

Using the chat-completions benchmark that I saw around here a few weeks ago, temperature=0.0, two rounds:

Test Run 1 Run 2
Q&A 256 tokens in 37.09s = 6.9 tok/s 256 tokens in 37.12s = 6.9 tok/s
Code 512 tokens in 57.68s = 8.9 tok/s 512 tokens in 57.66s = 8.9 tok/s
JSON 701 tokens in 79.50s = 8.8 tok/s 701 tokens in 79.44s = 8.8 tok/s
Math 9 tokens in 3.16s = 2.9 tok/s 9 tokens in 3.16s = 2.8 tok/s
LongCode 2048 tokens in 224.61s = 9.1 tok/s 2048 tokens in 224.55s = 9.1 tok/s

Prompt Processing

Method: exact chat-formatted prompt token counts came from the same live vLLM server via POST /tokenize. Each request used a different synthetic prompt with a unique prefix to avoid prefix-cache reuse. I streamed max_tokens=1 and computed prompt_tokens / TTFT, so this is a lower bound because TTFT includes HTTP/request overhead and one decode step.

Warmup:

  • 3026 prompt tokens, TTFT 18.203s, 166.2 tok/s

Measured runs:

Run Prompt tokens TTFT PP lower bound
1 3035 15.922s 190.6 tok/s
2 3031 15.924s 190.3 tok/s
3 3020 15.924s 189.7 tok/s
4 3045 15.949s 190.9 tok/s
5 3043 15.941s 190.9 tok/s

Average measured PP lower bound: 190.5 tok/s.

vLLM’s own 10-second log buckets during the same run showed about 302-304 tok/s prompt throughput with 0% prefix cache hit rate, but the table above is the more conservative endpoint-observed number.

I previously tested on llama.cpp’s llama-server, and saw similar prompt processing performance, but without Eagle, the token generation speed was literally about 2 tokens per second. So, 7 to 9 tokens per second is a huge jump up from that.

I haven’t really tried to do anything serious with this model, so I have no idea if it is any good or not, I just wanted to see what kind of performance was possible on a single DGX Spark node.

My thought process