Run Qwen3.5-27B with spark-vllm-docker

Has anyone successfully run Qwen2.5-27B on a DGX Spark and achieved decent inference speed?

I’m currently getting only about 4 tokens per second with both llama.cpp (BF16) and vLLM, and I’m looking for ways to improve performance.

My current setup:

  • Hardware: DGX Spark

  • Model: Qwen2.5-27B

  • Tried: llama.cpp (BF16 quantization) and vLLM

  • Current speed: ~4 tokens/second

Questions:

  1. Is this expected performance for this model on this hardware?

  2. What optimizations could I try to increase the token generation speed?

  3. Are there specific configuration settings (batch size, tensor parallelism, quantization methods) that work well for this combination?

Any insights or recommendations would be greatly appreciated!

Below is my current setting ( recipe using spark-vllm-docker)
description: vLLM serving Qwen3.5-27B on DGX Spark 128GB
model: Qwen/Qwen3.5-27B
container: vllm-node
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 1
gpu_memory_utilization: 0.81
max_model_len: 262144
env: {}
command: |
vllm serve Qwen/Qwen3.5-27B
–enable-auto-tool-choice
–tool-call-parser glm45
–reasoning-parser glm45
–language-model-only
–kv-cache-dtype fp8
–gpu-memory-utilization {gpu_memory_utilization}
–host {host}
–port {port}
–tensor-parallel-size {tensor_parallel}
–load-format fastsafetensors
–enable-prefix-caching
–attention-backend flashinfer
–max-num-seqs 32
–max-model-len {max_model_len}
–max-num-batched-tokens 32768
–trust-remote-code
recipe_version: ‘1’
name: Qwen3.5-27B
cluster_only: false
solo_only: true

This is a dense model that taxes the Spark’s Achilles heel: memory bandwidth limitations.

That’s about par for the course for a dense model that size. You can try llama.cpp to see if it runs a bit faster. A MoE model like Qwen 3.5 35b-a3b will give you much better performance.