Dgx spark benchmark performance

hxssgdev · August 27, 2025, 3:24pm

Is there any benchmark number like vllm overall throughput of oss-20b / 120b running on dgx spark?

Will mxfp4 be supported on dgx spark?

What is compute capability of dgx spark, is it 11.0 or 10.0? Will all existing vllm optimisations on B200 work out of box for GB10?

aniculescu · December 12, 2025, 10:25pm

Please check out our playbooks for how to run certain workloads like vLLM: Spark Playbooks
DGX Spark has Compute Compatibility sm121

eugr · December 16, 2025, 5:42pm

Here are my benchmarks for gpt-oss-120b for a single Spark. You will get much better performance from sglang:spark though, because mxfp4 is still not using fp4 features of blackwell on Spark in vllm, even when built from the main branch.

Also, while NVIDIA docker finally works well, it still lags behind mainline vllm, so if you want newer version, like 0.12.0+, you need to build from source.

Or use one of the community builds here. Here is mine: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Benchmark duration (s):                  35.74
Total input tokens:                      1364
Total generated tokens:                  2677
Request throughput (req/s):              0.28
Output token throughput (tok/s):         74.90
Peak output token throughput (tok/s):    114.00
Peak concurrent requests:                10.00
Total Token throughput (tok/s):          113.06
---------------Time to First Token----------------
Mean TTFT (ms):                          676.76
Median TTFT (ms):                        739.00
P99 TTFT (ms):                           740.48
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          74.12
Median TPOT (ms):                        75.95
P99 TPOT (ms):                           108.08
---------------Inter-token Latency----------------
Mean ITL (ms):                           55.82
Median ITL (ms):                         49.50
P99 ITL (ms):                            116.22
==================================================

============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  3.41
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.29
Output token throughput (tok/s):         34.94
Peak output token throughput (tok/s):    36.00
Peak concurrent requests:                1.00
Total Token throughput (tok/s):          38.46
---------------Time to First Token----------------
Mean TTFT (ms):                          102.94
Median TTFT (ms):                        102.94
P99 TTFT (ms):                           102.94
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          27.99
Median TPOT (ms):                        27.99
P99 TPOT (ms):                           27.99
---------------Inter-token Latency----------------
Mean ITL (ms):                           27.99
Median ITL (ms):                         27.90
P99 ITL (ms):                            29.84
==================================================

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Benchmark duration (s):                  123.85
Total input tokens:                      22946
Total generated tokens:                  21691
Request throughput (req/s):              0.81
Output token throughput (tok/s):         175.14
Peak output token throughput (tok/s):    375.00
Peak concurrent requests:                100.00
Total Token throughput (tok/s):          360.41
---------------Time to First Token----------------
Mean TTFT (ms):                          4889.59
Median TTFT (ms):                        4886.37
P99 TTFT (ms):                           8781.95
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          306.41
Median TPOT (ms):                        259.15
P99 TPOT (ms):                           798.04
---------------Inter-token Latency----------------
Mean ITL (ms):                           218.97
Median ITL (ms):                         220.40
P99 ITL (ms):                            796.03
==================================================

eugr · December 16, 2025, 5:46pm

SGLang for comparison:

Bench (10 prompts):

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  33.34     
Total input tokens:                      1364      
Total generated tokens:                  2677      
Request throughput (req/s):              0.30      
Output token throughput (tok/s):         80.29     
Peak output token throughput (tok/s):    125.00    
Peak concurrent requests:                10.00     
Total Token throughput (tok/s):          121.21    
---------------Time to First Token----------------
Mean TTFT (ms):                          118.81    
Median TTFT (ms):                        119.06    
P99 TTFT (ms):                           129.64    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          62.59     
Median TPOT (ms):                        67.01     
P99 TPOT (ms):                           73.63     
---------------Inter-token Latency----------------
Mean ITL (ms):                           52.37     
Median ITL (ms):                         47.58     
P99 ITL (ms):                            84.35     
==================================================

Bench (1 prompt):

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  2.27      
Total input tokens:                      12        
Total generated tokens:                  119       
Request throughput (req/s):              0.44      
Output token throughput (tok/s):         52.37     
Peak output token throughput (tok/s):    53.00     
Peak concurrent requests:                1.00      
Total Token throughput (tok/s):          57.65     
---------------Time to First Token----------------
Mean TTFT (ms):                          49.87     
Median TTFT (ms):                        49.87     
P99 TTFT (ms):                           49.87     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.83     
Median TPOT (ms):                        18.83     
P99 TPOT (ms):                           18.83     
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.83     
Median ITL (ms):                         18.82     
P99 ITL (ms):                            21.51     
==================================================

christopher_owen · December 16, 2025, 7:04pm

Sorry if this is derailing the thread, but I attempted to replicate this benchmarking yesterday and I very consistently get 35 TPS for a single thread and 60 TPS with max-concurrency 2.

How do you do your testing?

This is my docker-compose.yml:

services:

  vllm:
    image: spark-vllm:eugr
    restart: unless-stopped
    ports:
      - "8000:8000"

    runtime: nvidia
    ipc: host
    ulimits:
      memlock: -1
      stack: 67108864

    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./.cache/vllm:/root/.cache/vllm
      - ./.cache/torch:/root/.cache/torch
      - ./.cache/triton:/root/.cache/triton
      - ./chat_templates/gpt-oss-120b.jinja:/workspace/chat_templates/gpt-oss-120b.jinja:ro
      - ./bench:/bench

    command: >
      vllm serve openai/gpt-oss-120b
        --host 0.0.0.0
        --port 8000
        --served-model-name gpt-oss-120b
        --override-generation-config '{"temperature":1.0,"top_p":1.0,"top_k":0}'
        --chat-template /workspace/chat_templates/gpt-oss-120b.jinja
        --enable-auto-tool-choice
        --tool-call-parser=openai
        --reasoning-parser=openai_gptoss
        --gpu-memory-utilization 0.70
        --max-model-len 131072
        --max-num-seqs 2
        --async-scheduling
        --max-num-batched-tokens 8192
        --enable-prefix-caching
        --load-format fastsafetensors

this is my script/bench.sh:

#!/usr/bin/env bash

set -euo pipefail

BASE_URL="${BASE_URL:-http://127.0.0.1:8000}"
MODEL="${MODEL:-gpt-oss-120b}"                  # served name on your server
TOKENIZER_MODEL="${TOKENIZER_MODEL:-openai/gpt-oss-120b}"  # HF id for tokenizer
DATASET_PATH="${DATASET_PATH:-/bench/ShareGPT_V3_unfiltered_cleaned_split.json}"
OUTPUT_LEN="${OUTPUT_LEN:-1024}"
NUM_PROMPTS_LIST="${NUM_PROMPTS_LIST:-"1 10"}"

# Prefer container id from compose; fallback to the old hardcoded name
CONTAINER_ID="${CONTAINER_ID:-$(docker compose ps -q vllm 2>/dev/null || true)}"
CONTAINER_ID="${CONTAINER_ID:-spark-eugr-vllm-1}"

echo "[warmup] tiny request"
curl -fsS "$BASE_URL/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL}\",
    \"messages\": [{\"role\":\"user\",\"content\":\"Say hi.\"}],
    \"max_tokens\": 16,
    \"temperature\": 0
  }" >/dev/null

for NUM_PROMPTS in $NUM_PROMPTS_LIST; do
  echo
  echo "================ BENCH: model=${MODEL} num_prompts=${NUM_PROMPTS} output_len=${OUTPUT_LEN} ================"
  echo
  echo

  # 2>&1 is important: tqdm progress bars are often on stderr
  docker exec -it "$CONTAINER_ID" bash -lc "
    vllm bench serve \
      --backend openai-chat \
      --base-url ${BASE_URL} \
      --endpoint /v1/chat/completions \
      --model ${TOKENIZER_MODEL} \
      --served-model-name ${MODEL} \
      --temperature 1.0 \
      --top-p 1.0 \
      --top-k 0 \
      --dataset-name sharegpt \
      --dataset-path ${DATASET_PATH} \
      --num-prompts ${NUM_PROMPTS} \
      --sharegpt-output-len ${OUTPUT_LEN} \
      --max-concurrency 2
  " 2>/dev/null
done

my output looks something like this:


spark-bcc2:~/projects/ai/spark-eugr$ bin/bench.sh
[warmup] tiny request
================ BENCH: model=gpt-oss-120b num_prompts=1 output_len=1024 ================

(...)

                            Output tokens per secon
  35 +----------------------------------------------------------------------+
     |   *   ***  ***  *****  *****  ***  *****  *******  *******           |
     |* *                                                       *           |
  30 | *                                                        *           |
     |                                                           *          |
  25 |                                                           *          |
     |                                                           *          |
     |                                                           *          |
  20 |                                                            *         |
     |                                                            *         |
     |                                                            *         |
  15 |                                                            *         |
     |                                                             *        |
     |                                                             *        |
  10 |                                                             *        |
     |                                                             *        |
   5 |                                                              *       |
     |                                                              *       |
     |                                                               *      |
   0 +----------------------------------------------------------------------+
     0           5           10          15         20          25          30

                          Concurrent requests per second
    1 +---------------------------------------------------------------------+
      |                                                            *        |
      |                                                            *        |
      |                                                            *        |
  0.8 |                                                            *        |
      |                                                             *       |
      |                                                             *       |
      |                                                             *       |
  0.6 |                                                             *       |
      |                                                             *       |
      |                                                             *       |
  0.4 |                                                             *       |
      |                                                             *       |
      |                                                             *       |
      |                                                             *       |
  0.2 |                                                              *      |
      |                                                              *      |
      |                                                              *      |
      |                                                              *      |
    0 +---------------------------------------------------------------------+
      0           5          10          15          20         25          30

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Maximum request concurrency:             2         
Benchmark duration (s):                  26.28     
Total input tokens:                      12        
Total generated tokens:                  903       
Request throughput (req/s):              0.04      
Output token throughput (tok/s):         34.37     
Peak output token throughput (tok/s):    35.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          34.82     
---------------Time to First Token----------------
Mean TTFT (ms):                          35.46     
Median TTFT (ms):                        35.46     
P99 TTFT (ms):                           35.46     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          29.09     
Median TPOT (ms):                        29.09     
P99 TPOT (ms):                           29.09     
---------------Inter-token Latency----------------
Mean ITL (ms):                           29.29     
Median ITL (ms):                         29.07     
P99 ITL (ms):                            30.21     
==================================================


================ BENCH: model=gpt-oss-120b num_prompts=10 output_len=1024 ================

(...)

                            Output tokens per second
  70 +----------------------------------------------------------------------+
     |                                                                      |
     |                                                        *             |
  60 |*************************************************** *********         |
     |**            ** * ** *       ** * ** *      **** ***    ** *         |
  50 |               *    *         *    ** *       **   *     ** *         |
     |               *    *         *    *          *          *  *         |
     |                                   *          *          *  *         |
  40 |                                                         *   *        |
     |                                                             *        |
     |                                                             *        |
  30 |                                                             *        |
     |                                                             *        |
     |                                                             *        |
  20 |                                                             *        |
     |                                                              *       |
  10 |                                                              *       |
     |                                                              *       |
     |                                                              *       |
   0 +----------------------------------------------------------------------+
     0        20       40       60       80     100      120      140      160

                          Concurrent requests per second
    4 +---------------------------------------------------------------------+
      |              *                                                      |
  3.5 |              *                                                      |
      |              *                                                      |
      |              *                                                      |
    3 |              **    *         *    *         *    *     *            |
      |              **    *         *    *         *    *     *            |
  2.5 |              **    *         *    *         *    **    *            |
      |              **   **        **   **         *    **   **            |
    2 |************************************************************         |
      |                                                           *         |
      |                                                           *         |
  1.5 |                                                            *        |
      |                                                            *        |
    1 |                                                            *        |
      |                                                            *        |
      |                                                            *        |
  0.5 |                                                             *       |
      |                                                             *       |
    0 +---------------------------------------------------------------------+
      0        20       40      60       80      100      120     140      160

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Maximum request concurrency:             2         
Benchmark duration (s):                  140.95    
Total input tokens:                      1933      
Total generated tokens:                  8116      
Request throughput (req/s):              0.07      
Output token throughput (tok/s):         57.58     
Peak output token throughput (tok/s):    62.00     
Peak concurrent requests:                4.00      
Total token throughput (tok/s):          71.29     
---------------Time to First Token----------------
Mean TTFT (ms):                          339.27    
Median TTFT (ms):                        286.81    
P99 TTFT (ms):                           683.29    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          34.03     
Median TPOT (ms):                        34.20     
P99 TPOT (ms):                           34.49     
---------------Inter-token Latency----------------
Mean ITL (ms):                           34.35     
Median ITL (ms):                         34.12     
P99 ITL (ms):                            35.83     
==================================================

Is my testing methodology flawed? Do you have setup that makes your quicker?

eugr · December 16, 2025, 7:43pm

Your results are nearly identical to mine for vLLM. The ones that are quicker are from SGLang.
Or if you are looking at 10 requests one, I don’t limit max concurrency when running benches. Your bench is limited to 2, so when you run 10 requests, it still runs only two simultaneously.

christopher_owen · December 16, 2025, 7:49pm

Mine had:

Output token throughput (tok/s): 34.37
Peak output token throughput (tok/s): 35.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 34.82
---------------Time to First Token----------------
Mean TTFT (ms): 35.46
Median TTFT (ms): 35.46
P99 TTFT (ms): 35.46

yours had:

put token throughput (tok/s): 34.94
Peak output token throughput (tok/s): 36.00
Peak concurrent requests: 1.00
Total Token throughput (tok/s): 38.46
---------------Time to First Token----------------
Mean TTFT (ms): 102.94
Median TTFT (ms): 102.94
P99 TTFT (ms): 102.94

My TTFT is quite dramatically better, but you have meaningfully more TPS.

Do you suspect it’s just testing variance?

eugr · December 16, 2025, 8:07pm

Yes, I think that’s just testing variance. I would ignore total throughput numbers as it counts prefill and generation together.

Also, TTFT depends on whether your requests hit the cache or not. For instance, if you have your container running pretty much all the time (as your docker-compose suggests), and ran benches on this dataset before, it would still be in the cache, so TTFT would be very quick, but it won’t affect token generation much.

christopher_owen · December 16, 2025, 9:06pm

I’m not so sure about the TTFT, because your benchmarking technique is taking random samples from the ShareGPT_V3_unfiltered_cleaned_split.json?

eugr · December 16, 2025, 9:29pm

Yes, but if you run one bench for 100, for instance, you’re very likely to get cache hits on subsequent runs, even for a single request. Your TTFT looks too good to be true though.

eugr · December 16, 2025, 9:42pm

I’ll try to launch with your parameters. I guess your TTFT is because of a larger batch (8192 vs 2048 that is by default).

eugr · December 16, 2025, 9:56pm

OK, it looks like I was right about prefix caching. So, the thing is that before today it was actually picking requests sequentially from the dataset unless the dataset was randomly generated (“random”). I thought it was the case, so I tested before responding to you, and it won’t do that, so I assumed I was wrong before.

But then I stumbled upon this commit from earlier today. Since I built my container an hour ago, it was already incorporated into my bench: [Bugfix] Fix prefix_repetition routing in bench throughput (#29663) · vllm-project/vllm@676db55 · GitHub

TL;DR: if you rebuild the container and run the most recent version of vllm bench serve, you will get TTFT numbers closer to mine.

christopher_owen · December 16, 2025, 11:04pm

If I restart the docker before running the bench I get the same TTFT.

I am actively editing and rebuilding your Dockerfile right now, so I believe I have latest version if this is your default?

Maybe some of the volumes I mount in the docker-compose.yml are keeping vLLM in a ‘steady-state’ instead of the ‘cold-state’ for the bench?

P.S. Are you interested in PRs for touch-ups?

eugr · December 16, 2025, 11:41pm

My latest has nightly pytorch and flashinfer builds that I haven’t pushed to GitHub yet, but otherwise the main branch is up-to-date. I believe I pushed the last update yesterday.

I’m open to PRs, of course. I will need to make some changes in my setup, as GitHub repo is a replica of my private GitLab server repo with one-way sync, but I can work around it.

christian.weyer · December 21, 2025, 11:52am

Would your repo also be a good choice for just a single DGX Spark to run VLLM on?
Or put another way: are you aware of this solution and know the differences to your setup? Avarok/vllm-dgx-spark · Hugging Face

eugr · December 21, 2025, 5:31pm

Yes, sure, it works just as well on a single Spark.

As for the linked repository, it seems to be built upon an early version of my docker build, based on the inclusion of the CMakefiles.txt patch that has not been needed for a while now. It also removes Triton for some reason (the kernels were broken at some point, but the Triton itself wasn’t) and missing some other useful stuff.

christian.weyer · December 21, 2025, 5:43pm

Awesome, thanks. Will try your repo tomorrow :-)

Topic		Replies	Views
Setting up vLLM, SGLang or TensorRT on two DGX Sparks DGX Spark / GB10	16	2006	December 7, 2025
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	4610	March 6, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	5526	December 9, 2025
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5919	March 16, 2026
DGX Spark performance DGX Spark / GB10	49	5780	February 13, 2026
Benchmark Report: Qwen3.6-35B-A3B-NVFP4 on NVIDIA DGX Spark, Jetson Thor, Blackwell 6000 Pro DGX Spark / GB10 Projects	10	2076	June 2, 2026
DGX Spark: The Sovereign AI Stack — Dual-Model Architecture for Local Inference DGX Spark / GB10 Projects docker , spark , llm	9	1944	February 13, 2026
Best practices for running llvm bench DGX Spark / GB10	1	176	December 21, 2025
Introducing the Spark Arena DGX Spark / GB10	128	8849	April 10, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	32	3186	December 17, 2025

Dgx spark benchmark performance

Related topics