Has anyone successfully run Qwen2.5-27B on a DGX Spark and achieved decent inference speed?
I’m currently getting only about 4 tokens per second with both llama.cpp (BF16) and vLLM, and I’m looking for ways to improve performance.
My current setup:
-
Hardware: DGX Spark
-
Model: Qwen2.5-27B
-
Tried: llama.cpp (BF16 quantization) and vLLM
-
Current speed: ~4 tokens/second
Questions:
-
Is this expected performance for this model on this hardware?
-
What optimizations could I try to increase the token generation speed?
-
Are there specific configuration settings (batch size, tensor parallelism, quantization methods) that work well for this combination?
Any insights or recommendations would be greatly appreciated!
Below is my current setting ( recipe using spark-vllm-docker)
description: vLLM serving Qwen3.5-27B on DGX Spark 128GB
model: Qwen/Qwen3.5-27B
container: vllm-node
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 1
gpu_memory_utilization: 0.81
max_model_len: 262144
env: {}
command: |
vllm serve Qwen/Qwen3.5-27B
–enable-auto-tool-choice
–tool-call-parser glm45
–reasoning-parser glm45
–language-model-only
–kv-cache-dtype fp8
–gpu-memory-utilization {gpu_memory_utilization}
–host {host}
–port {port}
–tensor-parallel-size {tensor_parallel}
–load-format fastsafetensors
–enable-prefix-caching
–attention-backend flashinfer
–max-num-seqs 32
–max-model-len {max_model_len}
–max-num-batched-tokens 32768
–trust-remote-code
recipe_version: ‘1’
name: Qwen3.5-27B
cluster_only: false
solo_only: true