Hello, all! I recently received my DGX Spark (Founder’s Edition) and am in desperate need of advice.
I’m getting much worse performance than I expected using a custom inference benchmark script using transformers in Python. I’m consistently getting between 14 and 20 tokens/s running a llama 3B model and around 10 tokens/s for llama 8B. Both are running with bf16 precision using FlashAttention-2 inside the NGC PyTorch docker container. I’ve never seen power draw exceed ~25W (as reported by nvidia-smi). The DGX Dashboard reports ~95% gpu utilization during these benchmark tests.
My first question is: is this normal? Second, is there some more standard way I can compare the performance of my machine to others with the same hardware? If so, what’s a more appropriate benchmark? My concern is some kind of driver/firmware issue or worse, a hardware defect, so I want to rule these things out.
My situation is complicated by the fact that I’ve blown my data budget for the month downloading models, updates, etc., so I won’t be able to download anything significant for the next 10 days. I can maybe manage ~3 GB worth of downloads max.
I purchased this machine to do high-level research (experimenting with training/fine-tuning workflows, agentic architectures, etc.). I am not a hardware guy at all, so I’m looking for advice from people with more experience there.
Will gladly post any logs, outputs, or anything else that would help. Thanks in advance!
Please install and run Field Diagnostic which is designed to validate your Spark’s hardware health. DM me the logs and we confirm if your hardware is healthy.
Don’t use Transformers to run models. It’s slow and unoptimized. Use vLLM, llama.cpp or SGLang.
Don’t use BF16 models, there is no practical benefit. Use either FP8 (for smaller models) or AWQ 4-bit / GGUF 4-bit for larger ones. MoE models are a sweet spot for Spark.
Just FYI, you are trying to run dense models on a single-board computer with LPDDR5X memory chip (273 GB/s bandwidth). The decode phase of transformer-based autoregressive LLMs is memory-bound. I highly recommend reading this paper: https://www.arxiv.org/pdf/2601.05047
You can use techniques such as quantization and speculative decoding (e.g., EAGLE3), but these may impact accuracy. While speculative decoding theoretically does not affect accuracy, it can influence the generated results in practice. I agree with @eugr that MoE-based hybrid models (Transformer + Mamba) are a good choice for DGX Spark and NVIDIA Jetson Thor. I personally recommend NVFP4 models like:
Great tip however I’m seeing this (after the docker image downloads) - same command line above
~/Development/_custom_docker_images/vllm_updated$ ./run_NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4.sh
26.01-py3: Pulling from nvidia/vllm
Digest: sha256:e497b1248ad3d916673a3003524c667640590a0c6d49f7f1c573102673d02792
Status: Image is up to date for nvcr.io/nvidia/vllm:26.01-py3
docker: Error response from daemon: unknown or invalid runtime name: nvidia
Run ‘docker run --help’ for more information
~/Development/_custom_docker_images/vllm_updated
The READMEs in several have been updated within the last week, some within the last 2 days. This particular quant now has specific instructions for the Spark to use the official NGC vLLM container, and the official recommended start command is similar but not identical to that posted by shahizat above.
See “Use it with vLLM” and “Use it with TensorRT-LLM”
I’m pulling the new TensorRT-LLM container now, and would encourage some exploration and benchmarking of these and a couple other NVFP4 models (Qwen3-Next under both vLLM and TensorRT-LLM.