Models not using Spark GPU?

WillLee · December 15, 2025, 4:38am

I followed https://build.nvidia.com/spark/open-webui/sync to set up Ollama and Open WebUI. I start it from Nvidia Sync and have been testing models from Ollama but none of the models uses the GPU and are extremely slow in generating anything. I can tell because nvidia-smi and DGX Dashboard show 0% of GPU usage for every model I use while btop reports 100% CPU usage.

I verified docker is running correctly with the NVIDIA Container Runtime for Docker. The only time I see the GPU got used so far is during ComfyUI testing but the set up doesn’t rely on docker.

So what am I doing wrong with Ollama with Open WebUI docker container?

Thanks!

raphael.amorim · December 15, 2025, 4:46am

I can run models using vLLM on docker with high-utilization. I suggest you start using llama.cpp over ollama, it’s much more performant.

Sample with vLLM: Running nvidia/Nemotron-Nano-VL-12B-V2-NVFP4-QAD on your spark

To run llama.cpp from source. Install development tools:

sudo apt install clang cmake libcurl4-openssl-dev

Checkout llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

Build:

cmake -B build -DGGML_CUDA=ON -DGGML_CURL=ON
cmake --build build --config Release -j 20

Serve gpt-oss-120b

build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF \
-fa on -ngl 999 \
--jinja \
--ctx-size 0 \
-b 2048 -ub 2048 \
--no-mmap \
--temp 1.0 \
--top-p 1.0 \
--top-k 0 \
--reasoning-format auto \
--chat-template-kwargs "{\"reasoning_effort\": \"medium\"}"

Expected performance numbers with 1 or 2 sparks:

WillLee · December 15, 2025, 4:52am

Hmm, thanks. I need to read up more about llama.cpp. Can llama.cpp work with Open Web UI?

raphael.amorim · December 15, 2025, 4:55am

Yes you can use it and Llama.cpp has also an equivalent WebII as well. Ollama and LM studio uses llama.cpp under the hood, but Ollama is clearly underperforming on the spark right now, just ignore it for now.

eugr · December 15, 2025, 4:56am

Those numbers are for vllm. I can compile a similar table for llama.cpp, but just as a single data point, I get 58 t/s from gpt-oss-120b. It doesn’t scale as well on the cluster though - you are actually losing performance, but able to run larger models.

raphael.amorim · December 15, 2025, 5:00am

Yes, you’re going to loos performance because you won’t be using NCCL with 2-nodes and llama.cpp

WillLee · December 15, 2025, 5:08am

Interesting. Thanks all. I had no idea ollama is that sucky and I thought the models are too big to fit into Spark’s VRAM. How big of a model (120B, 20B etc.?) can the Spark VRAM handle?

raphael.amorim · December 15, 2025, 5:46am

Really depends how dense is the model, its architecture, and its quantization. GPT-OSS 120B, which is a MoE architecture, runs smoothly on the spark with 4 bit quantization. Llama 3 70B will be much slower

_cjg · December 15, 2025, 8:19am

It would be nice if we had a reference table somewhere (centrally) for guidance. Based on a reference LLM (e.g., gpt-oss-120b/Q4_K_XL), a direct comparison of the performance of llama.cpp, vllm, and sglang for a) one spark, b) two sparks (cluster), and c) x sparks (super cluster) would be quite useful (nvidia?). Based on this, we could also include the model-specific optimized parameters in the discussion.

raphael.amorim · December 15, 2025, 12:56pm

How NVIDIA DGX Spark’s Performance Enables Intensive AI Tasks | NVIDIA Technical Blog.

system · December 29, 2025, 12:57pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Very poor performance with Ollama on DGX Spark – looking for help DGX Spark / GB10 Projects	8	2012	January 20, 2026
GDX Spark is extremely slow on a short LLM test DGX Spark / GB10 deepseek	21	3677	January 25, 2026
Inconsistent Official Guides DGX Spark / GB10	5	338	November 30, 2025
DGX Spark performance DGX Spark / GB10	50	4304	February 27, 2026
When we install an LLM model and start a chat session, the response speed becomes extremely slow DGX Spark / GB10 llama	1	323	December 6, 2025
Moving from Mac to NVIDIA: bought powerful hardware, but drowning in configs DGX Spark / GB10 llama , nemotron	37	2316	February 25, 2026
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	4009	March 6, 2026
Accessing models on Web UI installed through nvida DGX instructions DGX Spark / GB10	1	111	March 4, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2862	December 31, 2025
Tutorial: Build llama.cpp from source and run Qwen3 235B DGX Spark / GB10 Projects llama	28	6448	January 20, 2026

Models not using Spark GPU?

Related topics