Investigating Performance Issue/Bottleneck

Hello, all! I recently received my DGX Spark (Founder’s Edition) and am in desperate need of advice.

I’m getting much worse performance than I expected using a custom inference benchmark script using transformers in Python. I’m consistently getting between 14 and 20 tokens/s running a llama 3B model and around 10 tokens/s for llama 8B. Both are running with bf16 precision using FlashAttention-2 inside the NGC PyTorch docker container. I’ve never seen power draw exceed ~25W (as reported by nvidia-smi). The DGX Dashboard reports ~95% gpu utilization during these benchmark tests.

My first question is: is this normal? Second, is there some more standard way I can compare the performance of my machine to others with the same hardware? If so, what’s a more appropriate benchmark? My concern is some kind of driver/firmware issue or worse, a hardware defect, so I want to rule these things out.

My situation is complicated by the fact that I’ve blown my data budget for the month downloading models, updates, etc., so I won’t be able to download anything significant for the next 10 days. I can maybe manage ~3 GB worth of downloads max.

I purchased this machine to do high-level research (experimenting with training/fine-tuning workflows, agentic architectures, etc.). I am not a hardware guy at all, so I’m looking for advice from people with more experience there.

Will gladly post any logs, outputs, or anything else that would help. Thanks in advance!

1 Like

Please install and run Field Diagnostic which is designed to validate your Spark’s hardware health. DM me the logs and we confirm if your hardware is healthy.

1 Like
  1. Don’t use Transformers to run models. It’s slow and unoptimized. Use vLLM, llama.cpp or SGLang.
  2. Don’t use BF16 models, there is no practical benefit. Use either FP8 (for smaller models) or AWQ 4-bit / GGUF 4-bit for larger ones. MoE models are a sweet spot for Spark.
  3. For vLLM, use our community Docker build: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks
  4. For benchmarks, use GitHub - eugr/llama-benchy: llama-benchy - llama-bench style benchmarking tool for all backends
1 Like

Also, llama models are very old.
Currently, for a single Spark the best models are:

  • GPT-OSS (both 20B and 120B)
  • GLM 4.5 Air
  • GLM 4.6V
  • GLM 4.7 Flash (it has some issues, but generally works with a mod)
  • Qwen3-30B and its variants (Qwen3-32B is a dense model and will be slow)

Some benchmarks (outdated, I will update it soon with newer models) to give you what performance to expect:

Model name Cluster (t/s) Single (t/s) Comment
Qwen/Qwen3-VL-32B-Instruct-FP8 12.00 7.00
cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit 21.00 12.00
GPT-OSS-120B 55.00 36.00 SGLang gives 75/53
RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4 21.00 N/A
QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ 26.00 N/A
Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 65.00 52.00
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ 97.00 82.00
RedHatAI/Qwen3-30B-A3B-NVFP4 75.00 64.00
QuantTrio/MiniMax-M2-AWQ 41.00 N/A
QuantTrio/GLM-4.6-AWQ 17.00 N/A
zai-org/GLM-4.6V-FP8 24.00 N/A
3 Likes

Thanks for the advice! I’ll give those a shot and see about getting my hands on the models you recommended.

These numbers are no longer correct ;).

Thanks for the opportunity to plug my effort!

3 Likes

Hello,

Just FYI, you are trying to run dense models on a single-board computer with LPDDR5X memory chip (273 GB/s bandwidth). The decode phase of transformer-based autoregressive LLMs is memory-bound. I highly recommend reading this paper:
https://www.arxiv.org/pdf/2601.05047

You can use techniques such as quantization and speculative decoding (e.g., EAGLE3), but these may impact accuracy. While speculative decoding theoretically does not affect accuracy, it can influence the generated results in practice. I agree with @eugr that MoE-based hybrid models (Transformer + Mamba) are a good choice for DGX Spark and NVIDIA Jetson Thor. I personally recommend NVFP4 models like:

For interactive chat applications with low-latency requirements( batch sizes of 1–16), I recommend using SGLang(GitHub - sgl-project/sglang: SGLang is a high-performance serving framework for large language models and multimodal models.) . If you need to handle moderate workloads with high throughput, vLLM(GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs) is a better choice.

Here is the command to run it on DGX Spark using vLLM:

sudo docker run -it --rm \
  --pull always \
  --runtime=nvidia \
  --network host \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  nvcr.io/nvidia/vllm:26.01-py3 \
  bash -c "wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4/resolve/main/nano_v3_reasoning_parser.py && \
  vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3 \
  --kv-cache-dtype fp8"

Hope it helps!!

Great tip however I’m seeing this (after the docker image downloads) - same command line above

~/Development/_custom_docker_images/vllm_updated$ ./run_NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4.sh
26.01-py3: Pulling from nvidia/vllm
Digest: sha256:e497b1248ad3d916673a3003524c667640590a0c6d49f7f1c573102673d02792
Status: Image is up to date for nvcr.io/nvidia/vllm:26.01-py3
docker: Error response from daemon: unknown or invalid runtime name: nvidia

Run ‘docker run --help’ for more information

~/Development/_custom_docker_images/vllm_updated

@LuckyChap please follow the instructions here to install the NVIDIA Container Toolkit:

and set the nvidia runtime as the default in the Docker daemon configuration file.

Hope it helps!!!

For anyone else who gets here in retrospect:

Nvidia has now updated a number of their official quants located under the umbrella collection on HuggingFace of Inference Optimized Checkpoints (with Model Optimizer) - a nvidia Collection

The READMEs in several have been updated within the last week, some within the last 2 days. This particular quant now has specific instructions for the Spark to use the official NGC vLLM container, and the official recommended start command is similar but not identical to that posted by shahizat above.

See “Use it with vLLM” and “Use it with TensorRT-LLM”

I’m pulling the new TensorRT-LLM container now, and would encourage some exploration and benchmarking of these and a couple other NVFP4 models (Qwen3-Next under both vLLM and TensorRT-LLM.

1 Like