Very poor performance with Ollama on DGX Spark – looking for help

deeduckme · December 3, 2025, 5:01pm

Hi everyone,

I installed Ollama on my DGX Spark to run a 20B ChatGPT OSS model, and the performance is honestly terrible.

From my Mac, I run a small Python script that reads about ten lines from an Excel file and sends each line to Ollama using:

http://<DGX_IP>:11434/api/generate

Everything runs through a Docker container, with only one Ollama instance active.

What I’m seeing:

For each generation, GPU usage jumps to around 89%.
Despite this high usage, latency is very bad.
Processing just 10 lines → 10 requests takes far longer than expected.
The 20B model performs nowhere near what I’d expect on hardware like a DGX Spark.

My question:

Am I missing something in the Ollama or container configuration?
Has anyone else experienced similar behavior on DGX systems or other GPU platforms?

Thanks in advance for any insights or feedback.

ibrunton_smith · December 3, 2025, 7:41pm

No answer from me, but I experienced similarly poor performance using Ollama with oss-got-20b and 120b on the spark. Switching to lm studio was much faster.

I also tried sglang but couldn’t beat the performance from lm studio,

interested to know if there is an obvious explanation.

raphael.amorim · December 4, 2025, 4:58am

@ibrunton_smith @deeduckme Ollama image is slow, please try this: GDX Spark is extremely slow on a short LLM test - #5 by cosinus

deeduckme · December 6, 2025, 10:01pm

I posted another thread performance - much better with llama.cpp

eugr · December 11, 2025, 7:40pm

There is very little reason to use Ollama these days.

Llama.cpp is faster, has a decent built-in webui, has very active development, and just introduced first-party model switching on demand: New in llama.cpp: Model Management

llama-swap still allows more granular control and can control vllm and other inference engines, but it’s great to have this functionality built-in.

raphael.amorim · December 12, 2025, 1:18am

Ollama is a no-go nowadays for me. Completely pointless.

vmm1234 · January 18, 2026, 5:48am

Just to share my experience, Ollama’s docker image have serious performance issues when running on DGX Spark. After re-installing it locally instead of the docker image the performance becomes normal.

So if your Ollama is a docker image I think this is the case. Also somehow even now webUI have weird issues when using images. I’m still looking for solutions :)

aweb · January 20, 2026, 10:27pm

I’m very curious about this. I’m trying to do OpenAI (or Ollama) remote LLM testing. Currently testing /embeddings and the llama.cpp responses are different from an OpenAI response (JSON formatting). So it’s not really “OpenAI compatible” yet.
I would try Ollama if it can perform at least close to the llama.cpp server. What’s the best way to install Ollama locally (not docker)?

Thanks.

system · February 3, 2026, 10:27pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
DGX Spark performance DGX Spark / GB10	27	1138	February 4, 2026
GDX Spark is extremely slow on a short LLM test DGX Spark / GB10	20	2128	January 25, 2026
Models not using Spark GPU? DGX Spark / GB10 containers	10	338	December 15, 2025
Day 1 with DGX Spark (Asus version) DGX Spark / GB10	26	1057	January 31, 2026
Reviews are coming in DGX Spark / GB10	27	5894	November 24, 2025
When we install an LLM model and start a chat session, the response speed becomes extremely slow DGX Spark / GB10 llama	1	196	December 6, 2025
DGX Spark vs AMD Strix Halo DGX Spark / GB10 llama	2	3133	October 23, 2025
Inconsistent Official Guides DGX Spark / GB10	5	218	November 30, 2025
Dgx spark benchmark performance DGX Spark / GB10	17	1492	January 4, 2026
Can I use Ollama or vLLM on the GB10 to run multiple LLM models simultaneously DGX Spark / GB10	8	453	December 13, 2025

Very poor performance with Ollama on DGX Spark – looking for help

What I’m seeing:

My question:

Related topics