Models not using Spark GPU?

I followed https://build.nvidia.com/spark/open-webui/sync to set up Ollama and Open WebUI. I start it from Nvidia Sync and have been testing models from Ollama but none of the models uses the GPU and are extremely slow in generating anything. I can tell because nvidia-smi and DGX Dashboard show 0% of GPU usage for every model I use while btop reports 100% CPU usage.

I verified docker is running correctly with the NVIDIA Container Runtime for Docker. The only time I see the GPU got used so far is during ComfyUI testing but the set up doesn’t rely on docker.

So what am I doing wrong with Ollama with Open WebUI docker container?

Thanks!

I can run models using vLLM on docker with high-utilization. I suggest you start using llama.cpp over ollama, it’s much more performant.

Sample with vLLM: Running nvidia/Nemotron-Nano-VL-12B-V2-NVFP4-QAD on your spark

To run llama.cpp from source. Install development tools:

sudo apt install clang cmake libcurl4-openssl-dev

Checkout llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

Build:

cmake -B build -DGGML_CUDA=ON -DGGML_CURL=ON
cmake --build build --config Release -j 20

Serve gpt-oss-120b

build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF \
-fa on -ngl 999 \
--jinja \
--ctx-size 0 \
-b 2048 -ub 2048 \
--no-mmap \
--temp 1.0 \
--top-p 1.0 \
--top-k 0 \
--reasoning-format auto \
--chat-template-kwargs "{\"reasoning_effort\": \"medium\"}"

Expected performance numbers with 1 or 2 sparks:

Hmm, thanks. I need to read up more about llama.cpp. Can llama.cpp work with Open Web UI?

Yes you can use it and Llama.cpp has also an equivalent WebII as well. Ollama and LM studio uses llama.cpp under the hood, but Ollama is clearly underperforming on the spark right now, just ignore it for now.

Those numbers are for vllm. I can compile a similar table for llama.cpp, but just as a single data point, I get 58 t/s from gpt-oss-120b. It doesn’t scale as well on the cluster though - you are actually losing performance, but able to run larger models.

Yes, you’re going to loos performance because you won’t be using NCCL with 2-nodes and llama.cpp

Interesting. Thanks all. I had no idea ollama is that sucky and I thought the models are too big to fit into Spark’s VRAM. How big of a model (120B, 20B etc.?) can the Spark VRAM handle?

Really depends how dense is the model, its architecture, and its quantization. GPT-OSS 120B, which is a MoE architecture, runs smoothly on the spark with 4 bit quantization. Llama 3 70B will be much slower

1 Like

It would be nice if we had a reference table somewhere (centrally) for guidance. Based on a reference LLM (e.g., gpt-oss-120b/Q4_K_XL), a direct comparison of the performance of llama.cpp, vllm, and sglang for a) one spark, b) two sparks (cluster), and c) x sparks (super cluster) would be quite useful (nvidia?). Based on this, we could also include the model-specific optimized parameters in the discussion.

2 Likes

How NVIDIA DGX Spark’s Performance Enables Intensive AI Tasks | NVIDIA Technical Blog.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.