Models not using Spark GPU?

raphael.amorim · December 15, 2025, 4:46am

I can run models using vLLM on docker with high-utilization. I suggest you start using llama.cpp over ollama, it’s much more performant.

Sample with vLLM: Running nvidia/Nemotron-Nano-VL-12B-V2-NVFP4-QAD on your spark

To run llama.cpp from source. Install development tools:

sudo apt install clang cmake libcurl4-openssl-dev

Checkout llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

Build:

cmake -B build -DGGML_CUDA=ON -DGGML_CURL=ON
cmake --build build --config Release -j 20

Serve gpt-oss-120b

build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF \
-fa on -ngl 999 \
--jinja \
--ctx-size 0 \
-b 2048 -ub 2048 \
--no-mmap \
--temp 1.0 \
--top-p 1.0 \
--top-k 0 \
--reasoning-format auto \
--chat-template-kwargs "{\"reasoning_effort\": \"medium\"}"

Expected performance numbers with 1 or 2 sparks:

Topic		Replies	Views
Very poor performance with Ollama on DGX Spark – looking for help DGX Spark / GB10 Projects	8	1033	January 20, 2026
GDX Spark is extremely slow on a short LLM test DGX Spark / GB10	20	2103	January 25, 2026
Inconsistent Official Guides DGX Spark / GB10	5	218	November 30, 2025
DGX Spark performance DGX Spark / GB10	17	1004	January 21, 2026
When we install an LLM model and start a chat session, the response speed becomes extremely slow DGX Spark / GB10 llama	1	193	December 6, 2025
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	10	1045	January 25, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	1330	December 31, 2025
Tutorial: Build llama.cpp from source and run Qwen3 235B DGX Spark / GB10 Projects llama	28	3104	January 20, 2026
Pre-installed Ollama Configuration DGX Spark / GB10	10	820	November 24, 2025
Investigating Performance Issue/Bottleneck DGX Spark / GB10 llama , agentic-ai	9	203	February 1, 2026

Models not using Spark GPU?

Related topics