Dgx spark benchmark performance

Is there any benchmark number like vllm overall throughput of oss-20b / 120b running on dgx spark?

Will mxfp4 be supported on dgx spark?

What is compute capability of dgx spark, is it 11.0 or 10.0? Will all existing vllm optimisations on B200 work out of box for GB10?

Please check out our playbooks for how to run certain workloads like vLLM: Spark Playbooks
DGX Spark has Compute Compatibility sm121

Here are my benchmarks for gpt-oss-120b for a single Spark. You will get much better performance from sglang:spark though, because mxfp4 is still not using fp4 features of blackwell on Spark in vllm, even when built from the main branch.

Also, while NVIDIA docker finally works well, it still lags behind mainline vllm, so if you want newer version, like 0.12.0+, you need to build from source.

Or use one of the community builds here. Here is mine: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Benchmark duration (s):                  35.74
Total input tokens:                      1364
Total generated tokens:                  2677
Request throughput (req/s):              0.28
Output token throughput (tok/s):         74.90
Peak output token throughput (tok/s):    114.00
Peak concurrent requests:                10.00
Total Token throughput (tok/s):          113.06
---------------Time to First Token----------------
Mean TTFT (ms):                          676.76
Median TTFT (ms):                        739.00
P99 TTFT (ms):                           740.48
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          74.12
Median TPOT (ms):                        75.95
P99 TPOT (ms):                           108.08
---------------Inter-token Latency----------------
Mean ITL (ms):                           55.82
Median ITL (ms):                         49.50
P99 ITL (ms):                            116.22
==================================================

============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  3.41
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.29
Output token throughput (tok/s):         34.94
Peak output token throughput (tok/s):    36.00
Peak concurrent requests:                1.00
Total Token throughput (tok/s):          38.46
---------------Time to First Token----------------
Mean TTFT (ms):                          102.94
Median TTFT (ms):                        102.94
P99 TTFT (ms):                           102.94
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          27.99
Median TPOT (ms):                        27.99
P99 TPOT (ms):                           27.99
---------------Inter-token Latency----------------
Mean ITL (ms):                           27.99
Median ITL (ms):                         27.90
P99 ITL (ms):                            29.84
==================================================

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Benchmark duration (s):                  123.85
Total input tokens:                      22946
Total generated tokens:                  21691
Request throughput (req/s):              0.81
Output token throughput (tok/s):         175.14
Peak output token throughput (tok/s):    375.00
Peak concurrent requests:                100.00
Total Token throughput (tok/s):          360.41
---------------Time to First Token----------------
Mean TTFT (ms):                          4889.59
Median TTFT (ms):                        4886.37
P99 TTFT (ms):                           8781.95
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          306.41
Median TPOT (ms):                        259.15
P99 TPOT (ms):                           798.04
---------------Inter-token Latency----------------
Mean ITL (ms):                           218.97
Median ITL (ms):                         220.40
P99 ITL (ms):                            796.03
==================================================

SGLang for comparison:

Bench (10 prompts):

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  33.34     
Total input tokens:                      1364      
Total generated tokens:                  2677      
Request throughput (req/s):              0.30      
Output token throughput (tok/s):         80.29     
Peak output token throughput (tok/s):    125.00    
Peak concurrent requests:                10.00     
Total Token throughput (tok/s):          121.21    
---------------Time to First Token----------------
Mean TTFT (ms):                          118.81    
Median TTFT (ms):                        119.06    
P99 TTFT (ms):                           129.64    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          62.59     
Median TPOT (ms):                        67.01     
P99 TPOT (ms):                           73.63     
---------------Inter-token Latency----------------
Mean ITL (ms):                           52.37     
Median ITL (ms):                         47.58     
P99 ITL (ms):                            84.35     
==================================================

Bench (1 prompt):

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Benchmark duration (s):                  2.27      
Total input tokens:                      12        
Total generated tokens:                  119       
Request throughput (req/s):              0.44      
Output token throughput (tok/s):         52.37     
Peak output token throughput (tok/s):    53.00     
Peak concurrent requests:                1.00      
Total Token throughput (tok/s):          57.65     
---------------Time to First Token----------------
Mean TTFT (ms):                          49.87     
Median TTFT (ms):                        49.87     
P99 TTFT (ms):                           49.87     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.83     
Median TPOT (ms):                        18.83     
P99 TPOT (ms):                           18.83     
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.83     
Median ITL (ms):                         18.82     
P99 ITL (ms):                            21.51     
==================================================

Sorry if this is derailing the thread, but I attempted to replicate this benchmarking yesterday and I very consistently get 35 TPS for a single thread and 60 TPS with max-concurrency 2.

How do you do your testing?

This is my docker-compose.yml:

services:

  vllm:
    image: spark-vllm:eugr
    restart: unless-stopped
    ports:
      - "8000:8000"

    runtime: nvidia
    ipc: host
    ulimits:
      memlock: -1
      stack: 67108864

    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./.cache/vllm:/root/.cache/vllm
      - ./.cache/torch:/root/.cache/torch
      - ./.cache/triton:/root/.cache/triton
      - ./chat_templates/gpt-oss-120b.jinja:/workspace/chat_templates/gpt-oss-120b.jinja:ro
      - ./bench:/bench

    command: >
      vllm serve openai/gpt-oss-120b
        --host 0.0.0.0
        --port 8000
        --served-model-name gpt-oss-120b
        --override-generation-config '{"temperature":1.0,"top_p":1.0,"top_k":0}'
        --chat-template /workspace/chat_templates/gpt-oss-120b.jinja
        --enable-auto-tool-choice
        --tool-call-parser=openai
        --reasoning-parser=openai_gptoss
        --gpu-memory-utilization 0.70
        --max-model-len 131072
        --max-num-seqs 2
        --async-scheduling
        --max-num-batched-tokens 8192
        --enable-prefix-caching
        --load-format fastsafetensors

this is my script/bench.sh:

#!/usr/bin/env bash

set -euo pipefail

BASE_URL="${BASE_URL:-http://127.0.0.1:8000}"
MODEL="${MODEL:-gpt-oss-120b}"                  # served name on your server
TOKENIZER_MODEL="${TOKENIZER_MODEL:-openai/gpt-oss-120b}"  # HF id for tokenizer
DATASET_PATH="${DATASET_PATH:-/bench/ShareGPT_V3_unfiltered_cleaned_split.json}"
OUTPUT_LEN="${OUTPUT_LEN:-1024}"
NUM_PROMPTS_LIST="${NUM_PROMPTS_LIST:-"1 10"}"

# Prefer container id from compose; fallback to the old hardcoded name
CONTAINER_ID="${CONTAINER_ID:-$(docker compose ps -q vllm 2>/dev/null || true)}"
CONTAINER_ID="${CONTAINER_ID:-spark-eugr-vllm-1}"

echo "[warmup] tiny request"
curl -fsS "$BASE_URL/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL}\",
    \"messages\": [{\"role\":\"user\",\"content\":\"Say hi.\"}],
    \"max_tokens\": 16,
    \"temperature\": 0
  }" >/dev/null

for NUM_PROMPTS in $NUM_PROMPTS_LIST; do
  echo
  echo "================ BENCH: model=${MODEL} num_prompts=${NUM_PROMPTS} output_len=${OUTPUT_LEN} ================"
  echo
  echo

  # 2>&1 is important: tqdm progress bars are often on stderr
  docker exec -it "$CONTAINER_ID" bash -lc "
    vllm bench serve \
      --backend openai-chat \
      --base-url ${BASE_URL} \
      --endpoint /v1/chat/completions \
      --model ${TOKENIZER_MODEL} \
      --served-model-name ${MODEL} \
      --temperature 1.0 \
      --top-p 1.0 \
      --top-k 0 \
      --dataset-name sharegpt \
      --dataset-path ${DATASET_PATH} \
      --num-prompts ${NUM_PROMPTS} \
      --sharegpt-output-len ${OUTPUT_LEN} \
      --max-concurrency 2
  " 2>/dev/null
done

my output looks something like this:


spark-bcc2:~/projects/ai/spark-eugr$ bin/bench.sh
[warmup] tiny request
================ BENCH: model=gpt-oss-120b num_prompts=1 output_len=1024 ================

(...)

                            Output tokens per secon
  35 +----------------------------------------------------------------------+
     |   *   ***  ***  *****  *****  ***  *****  *******  *******           |
     |* *                                                       *           |
  30 | *                                                        *           |
     |                                                           *          |
  25 |                                                           *          |
     |                                                           *          |
     |                                                           *          |
  20 |                                                            *         |
     |                                                            *         |
     |                                                            *         |
  15 |                                                            *         |
     |                                                             *        |
     |                                                             *        |
  10 |                                                             *        |
     |                                                             *        |
   5 |                                                              *       |
     |                                                              *       |
     |                                                               *      |
   0 +----------------------------------------------------------------------+
     0           5           10          15         20          25          30

                          Concurrent requests per second
    1 +---------------------------------------------------------------------+
      |                                                            *        |
      |                                                            *        |
      |                                                            *        |
  0.8 |                                                            *        |
      |                                                             *       |
      |                                                             *       |
      |                                                             *       |
  0.6 |                                                             *       |
      |                                                             *       |
      |                                                             *       |
  0.4 |                                                             *       |
      |                                                             *       |
      |                                                             *       |
      |                                                             *       |
  0.2 |                                                              *      |
      |                                                              *      |
      |                                                              *      |
      |                                                              *      |
    0 +---------------------------------------------------------------------+
      0           5          10          15          20         25          30

============ Serving Benchmark Result ============
Successful requests:                     1         
Failed requests:                         0         
Maximum request concurrency:             2         
Benchmark duration (s):                  26.28     
Total input tokens:                      12        
Total generated tokens:                  903       
Request throughput (req/s):              0.04      
Output token throughput (tok/s):         34.37     
Peak output token throughput (tok/s):    35.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          34.82     
---------------Time to First Token----------------
Mean TTFT (ms):                          35.46     
Median TTFT (ms):                        35.46     
P99 TTFT (ms):                           35.46     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          29.09     
Median TPOT (ms):                        29.09     
P99 TPOT (ms):                           29.09     
---------------Inter-token Latency----------------
Mean ITL (ms):                           29.29     
Median ITL (ms):                         29.07     
P99 ITL (ms):                            30.21     
==================================================


================ BENCH: model=gpt-oss-120b num_prompts=10 output_len=1024 ================

(...)

                            Output tokens per second
  70 +----------------------------------------------------------------------+
     |                                                                      |
     |                                                        *             |
  60 |*************************************************** *********         |
     |**            ** * ** *       ** * ** *      **** ***    ** *         |
  50 |               *    *         *    ** *       **   *     ** *         |
     |               *    *         *    *          *          *  *         |
     |                                   *          *          *  *         |
  40 |                                                         *   *        |
     |                                                             *        |
     |                                                             *        |
  30 |                                                             *        |
     |                                                             *        |
     |                                                             *        |
  20 |                                                             *        |
     |                                                              *       |
  10 |                                                              *       |
     |                                                              *       |
     |                                                              *       |
   0 +----------------------------------------------------------------------+
     0        20       40       60       80     100      120      140      160

                          Concurrent requests per second
    4 +---------------------------------------------------------------------+
      |              *                                                      |
  3.5 |              *                                                      |
      |              *                                                      |
      |              *                                                      |
    3 |              **    *         *    *         *    *     *            |
      |              **    *         *    *         *    *     *            |
  2.5 |              **    *         *    *         *    **    *            |
      |              **   **        **   **         *    **   **            |
    2 |************************************************************         |
      |                                                           *         |
      |                                                           *         |
  1.5 |                                                            *        |
      |                                                            *        |
    1 |                                                            *        |
      |                                                            *        |
      |                                                            *        |
  0.5 |                                                             *       |
      |                                                             *       |
    0 +---------------------------------------------------------------------+
      0        20       40      60       80      100      120     140      160

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Maximum request concurrency:             2         
Benchmark duration (s):                  140.95    
Total input tokens:                      1933      
Total generated tokens:                  8116      
Request throughput (req/s):              0.07      
Output token throughput (tok/s):         57.58     
Peak output token throughput (tok/s):    62.00     
Peak concurrent requests:                4.00      
Total token throughput (tok/s):          71.29     
---------------Time to First Token----------------
Mean TTFT (ms):                          339.27    
Median TTFT (ms):                        286.81    
P99 TTFT (ms):                           683.29    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          34.03     
Median TPOT (ms):                        34.20     
P99 TPOT (ms):                           34.49     
---------------Inter-token Latency----------------
Mean ITL (ms):                           34.35     
Median ITL (ms):                         34.12     
P99 ITL (ms):                            35.83     
==================================================

Is my testing methodology flawed? Do you have setup that makes your quicker?

Your results are nearly identical to mine for vLLM. The ones that are quicker are from SGLang.
Or if you are looking at 10 requests one, I don’t limit max concurrency when running benches. Your bench is limited to 2, so when you run 10 requests, it still runs only two simultaneously.

Mine had:

Output token throughput (tok/s): 34.37
Peak output token throughput (tok/s): 35.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 34.82
---------------Time to First Token----------------
Mean TTFT (ms): 35.46
Median TTFT (ms): 35.46
P99 TTFT (ms): 35.46

yours had:

put token throughput (tok/s): 34.94
Peak output token throughput (tok/s): 36.00
Peak concurrent requests: 1.00
Total Token throughput (tok/s): 38.46
---------------Time to First Token----------------
Mean TTFT (ms): 102.94
Median TTFT (ms): 102.94
P99 TTFT (ms): 102.94

My TTFT is quite dramatically better, but you have meaningfully more TPS.

Do you suspect it’s just testing variance?

Yes, I think that’s just testing variance. I would ignore total throughput numbers as it counts prefill and generation together.

Also, TTFT depends on whether your requests hit the cache or not. For instance, if you have your container running pretty much all the time (as your docker-compose suggests), and ran benches on this dataset before, it would still be in the cache, so TTFT would be very quick, but it won’t affect token generation much.

I’m not so sure about the TTFT, because your benchmarking technique is taking random samples from the ShareGPT_V3_unfiltered_cleaned_split.json?

Yes, but if you run one bench for 100, for instance, you’re very likely to get cache hits on subsequent runs, even for a single request. Your TTFT looks too good to be true though.

I’ll try to launch with your parameters. I guess your TTFT is because of a larger batch (8192 vs 2048 that is by default).

OK, it looks like I was right about prefix caching. So, the thing is that before today it was actually picking requests sequentially from the dataset unless the dataset was randomly generated (ā€œrandomā€). I thought it was the case, so I tested before responding to you, and it won’t do that, so I assumed I was wrong before.

But then I stumbled upon this commit from earlier today. Since I built my container an hour ago, it was already incorporated into my bench: [Bugfix] Fix prefix_repetition routing in bench throughput (#29663) Ā· vllm-project/vllm@676db55 Ā· GitHub

TL;DR: if you rebuild the container and run the most recent version of vllm bench serve, you will get TTFT numbers closer to mine.

If I restart the docker before running the bench I get the same TTFT.

I am actively editing and rebuilding your Dockerfile right now, so I believe I have latest version if this is your default?

Maybe some of the volumes I mount in the docker-compose.yml are keeping vLLM in a ā€˜steady-state’ instead of the ā€˜cold-state’ for the bench?

P.S. Are you interested in PRs for touch-ups?

My latest has nightly pytorch and flashinfer builds that I haven’t pushed to GitHub yet, but otherwise the main branch is up-to-date. I believe I pushed the last update yesterday.

I’m open to PRs, of course. I will need to make some changes in my setup, as GitHub repo is a replica of my private GitLab server repo with one-way sync, but I can work around it.

Would your repo also be a good choice for just a single DGX Spark to run VLLM on?
Or put another way: are you aware of this solution and know the differences to your setup? Avarok/vllm-dgx-spark Ā· Hugging Face

Yes, sure, it works just as well on a single Spark.

As for the linked repository, it seems to be built upon an early version of my docker build, based on the inclusion of the CMakefiles.txt patch that has not been needed for a while now. It also removes Triton for some reason (the kernels were broken at some point, but the Triton itself wasn’t) and missing some other useful stuff.

Awesome, thanks. Will try your repo tomorrow :-)