Run Qwen3.5-27B with spark-vllm-docker

dashtotherock · March 5, 2026, 4:59pm

Has anyone successfully run Qwen2.5-27B on a DGX Spark and achieved decent inference speed?

I’m currently getting only about 4 tokens per second with both llama.cpp (BF16) and vLLM, and I’m looking for ways to improve performance.

My current setup:

Hardware: DGX Spark
Model: Qwen2.5-27B
Tried: llama.cpp (BF16 quantization) and vLLM
Current speed: ~4 tokens/second

Questions:

Is this expected performance for this model on this hardware?
What optimizations could I try to increase the token generation speed?
Are there specific configuration settings (batch size, tensor parallelism, quantization methods) that work well for this combination?

Any insights or recommendations would be greatly appreciated!

Below is my current setting ( recipe using spark-vllm-docker)
description: vLLM serving Qwen3.5-27B on DGX Spark 128GB
model: Qwen/Qwen3.5-27B
container: vllm-node
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 1
gpu_memory_utilization: 0.81
max_model_len: 262144
env: {}
command: |
vllm serve Qwen/Qwen3.5-27B
–enable-auto-tool-choice
–tool-call-parser glm45
–reasoning-parser glm45
–language-model-only
–kv-cache-dtype fp8
–gpu-memory-utilization {gpu_memory_utilization}
–host {host}
–port {port}
–tensor-parallel-size {tensor_parallel}
–load-format fastsafetensors
–enable-prefix-caching
–attention-backend flashinfer
–max-num-seqs 32
–max-model-len {max_model_len}
–max-num-batched-tokens 32768
–trust-remote-code
recipe_version: ‘1’
name: Qwen3.5-27B
cluster_only: false
solo_only: true

josephbreda · March 5, 2026, 7:00pm

This is a dense model that taxes the Spark’s Achilles heel: memory bandwidth limitations.

That’s about par for the course for a dense model that size. You can try llama.cpp to see if it runs a bit faster. A MoE model like Qwen 3.5 35b-a3b will give you much better performance.

Topic		Replies	Views
Can someone please just help me set the DGX Spark up for optimal LLM use? DGX Spark / GB10 llama	11	1257	June 20, 2026
Qwen3.6-27B AWQ INT4 on DGX Spark (GB10) — only 1.8-4.9 tok/s decode with 285k token prompt, how to improve? DGX Spark / GB10	6	1026	May 29, 2026
Custom built vLLM + Qwen3.5-35B on NVIDIA DGX Spark (GB10) — sustained 50 tok/s, 1M context DGX Spark / GB10	18	4148	May 7, 2026
Question on Inference Performance Results of Qwen3 235B A22B on 2× DGX Spark DGX Spark / GB10 cuda	5	793	December 19, 2025
Qwen3.5-35B-A3B on NVIDIA DGX Spark DGX Spark / GB10	4	3599	March 17, 2026
Distributed Inference - 200gb/s with bottleneck, am I missing something? DGX Spark / GB10 llama	5	738	January 22, 2026
Running QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ on 2 node spark DGX Spark / GB10	2	281	April 21, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	11826	April 9, 2026
TensorRT-LLM + nvidia/Llama-3.3-70B-Instruct-NVFP4 = 5 tok/s DGX Spark / GB10 llama	3	724	January 18, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	6266	March 16, 2026

Run Qwen3.5-27B with spark-vllm-docker

Related topics