Investigating Performance Issue/Bottleneck

stark.forest · January 31, 2026, 2:46am

Hello, all! I recently received my DGX Spark (Founder’s Edition) and am in desperate need of advice.

I’m getting much worse performance than I expected using a custom inference benchmark script using transformers in Python. I’m consistently getting between 14 and 20 tokens/s running a llama 3B model and around 10 tokens/s for llama 8B. Both are running with bf16 precision using FlashAttention-2 inside the NGC PyTorch docker container. I’ve never seen power draw exceed ~25W (as reported by nvidia-smi). The DGX Dashboard reports ~95% gpu utilization during these benchmark tests.

My first question is: is this normal? Second, is there some more standard way I can compare the performance of my machine to others with the same hardware? If so, what’s a more appropriate benchmark? My concern is some kind of driver/firmware issue or worse, a hardware defect, so I want to rule these things out.

My situation is complicated by the fact that I’ve blown my data budget for the month downloading models, updates, etc., so I won’t be able to download anything significant for the next 10 days. I can maybe manage ~3 GB worth of downloads max.

I purchased this machine to do high-level research (experimenting with training/fine-tuning workflows, agentic architectures, etc.). I am not a hardware guy at all, so I’m looking for advice from people with more experience there.

Will gladly post any logs, outputs, or anything else that would help. Thanks in advance!

NVES · January 31, 2026, 2:53am

Please install and run Field Diagnostic which is designed to validate your Spark’s hardware health. DM me the logs and we confirm if your hardware is healthy.

eugr · January 31, 2026, 2:53am

Don’t use Transformers to run models. It’s slow and unoptimized. Use vLLM, llama.cpp or SGLang.
Don’t use BF16 models, there is no practical benefit. Use either FP8 (for smaller models) or AWQ 4-bit / GGUF 4-bit for larger ones. MoE models are a sweet spot for Spark.
For vLLM, use our community Docker build: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks
For benchmarks, use GitHub - eugr/llama-benchy: llama-benchy - llama-bench style benchmarking tool for all backends

eugr · January 31, 2026, 2:56am

Also, llama models are very old.
Currently, for a single Spark the best models are:

GPT-OSS (both 20B and 120B)
GLM 4.5 Air
GLM 4.6V
GLM 4.7 Flash (it has some issues, but generally works with a mod)
Qwen3-30B and its variants (Qwen3-32B is a dense model and will be slow)

Some benchmarks (outdated, I will update it soon with newer models) to give you what performance to expect:

Model name	Cluster (t/s)	Single (t/s)	Comment
Qwen/Qwen3-VL-32B-Instruct-FP8	12.00	7.00
cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit	21.00	12.00
GPT-OSS-120B	55.00	36.00	SGLang gives 75/53
RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4	21.00	N/A
QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ	26.00	N/A
Qwen/Qwen3-VL-30B-A3B-Instruct-FP8	65.00	52.00
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ	97.00	82.00
RedHatAI/Qwen3-30B-A3B-NVFP4	75.00	64.00
QuantTrio/MiniMax-M2-AWQ	41.00	N/A
QuantTrio/GLM-4.6-AWQ	17.00	N/A
zai-org/GLM-4.6V-FP8	24.00	N/A

stark.forest · January 31, 2026, 4:34am

Thanks for the advice! I’ll give those a shot and see about getting my hands on the models you recommended.

christopher_owen · January 31, 2026, 9:17am

These numbers are no longer correct ;).

Thanks for the opportunity to plug my effort!

shahizat · January 31, 2026, 7:21pm

Hello,

Just FYI, you are trying to run dense models on a single-board computer with LPDDR5X memory chip (273 GB/s bandwidth). The decode phase of transformer-based autoregressive LLMs is memory-bound. I highly recommend reading this paper:
https://www.arxiv.org/pdf/2601.05047

You can use techniques such as quantization and speculative decoding (e.g., EAGLE3), but these may impact accuracy. While speculative decoding theoretically does not affect accuracy, it can influence the generated results in practice. I agree with @eugr that MoE-based hybrid models (Transformer + Mamba) are a good choice for DGX Spark and NVIDIA Jetson Thor. I personally recommend NVFP4 models like:

For interactive chat applications with low-latency requirements( batch sizes of 1–16), I recommend using SGLang(GitHub - sgl-project/sglang: SGLang is a high-performance serving framework for large language models and multimodal models.) . If you need to handle moderate workloads with high throughput, vLLM(GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs) is a better choice.

Here is the command to run it on DGX Spark using vLLM:

sudo docker run -it --rm \
  --pull always \
  --runtime=nvidia \
  --network host \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  nvcr.io/nvidia/vllm:26.01-py3 \
  bash -c "wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4/resolve/main/nano_v3_reasoning_parser.py && \
  vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3 \
  --kv-cache-dtype fp8"

Hope it helps!!

LuckyChap · January 31, 2026, 8:55pm

Great tip however I’m seeing this (after the docker image downloads) - same command line above

~/Development/_custom_docker_images/vllm_updated$ ./run_NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4.sh
26.01-py3: Pulling from nvidia/vllm
Digest: sha256:e497b1248ad3d916673a3003524c667640590a0c6d49f7f1c573102673d02792
Status: Image is up to date for nvcr.io/nvidia/vllm:26.01-py3
docker: Error response from daemon: unknown or invalid runtime name: nvidia

Run ‘docker run --help’ for more information

~/Development/_custom_docker_images/vllm_updated

shahizat · February 1, 2026, 4:43am

@LuckyChap please follow the instructions here to install the NVIDIA Container Toolkit:

and set the nvidia runtime as the default in the Docker daemon configuration file.

Hope it helps!!!

joshua.dale.warner · February 1, 2026, 5:31am

For anyone else who gets here in retrospect:

Nvidia has now updated a number of their official quants located under the umbrella collection on HuggingFace of Inference Optimized Checkpoints (with Model Optimizer) - a nvidia Collection

The READMEs in several have been updated within the last week, some within the last 2 days. This particular quant now has specific instructions for the Spark to use the official NGC vLLM container, and the official recommended start command is similar but not identical to that posted by shahizat above.

See “Use it with vLLM” and “Use it with TensorRT-LLM”

I’m pulling the new TensorRT-LLM container now, and would encourage some exploration and benchmarking of these and a couple other NVFP4 models (Qwen3-Next under both vLLM and TensorRT-LLM.

Topic		Replies	Views
Dgx spark benchmark performance DGX Spark / GB10	17	1485	January 4, 2026
DGX Spark performance DGX Spark / GB10	19	1043	February 4, 2026
Question on Inference Performance Results of Qwen3 235B A22B on 2× DGX Spark DGX Spark / GB10 cuda	5	345	December 19, 2025
Distributed Inference - 200gb/s with bottleneck, am I missing something? DGX Spark / GB10 llama	5	223	January 22, 2026
DGX Spark vs AMD Strix Halo DGX Spark / GB10 llama	2	3112	October 23, 2025
DGX Spark Power Consumption DGX Spark / GB10	2	296	November 3, 2025
Best practices for running llvm bench DGX Spark / GB10	2	103	December 21, 2025
When we install an LLM model and start a chat session, the response speed becomes extremely slow DGX Spark / GB10 llama	1	193	December 6, 2025
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	10	1064	January 25, 2026
GDX Spark is extremely slow on a short LLM test DGX Spark / GB10	20	2105	January 25, 2026

Investigating Performance Issue/Bottleneck

Related topics