Very poor performance with Ollama on DGX Spark – looking for help

Hi everyone,

I installed Ollama on my DGX Spark to run a 20B ChatGPT OSS model, and the performance is honestly terrible.

From my Mac, I run a small Python script that reads about ten lines from an Excel file and sends each line to Ollama using:

http://<DGX_IP>:11434/api/generate

Everything runs through a Docker container, with only one Ollama instance active.

What I’m seeing:

  • For each generation, GPU usage jumps to around 89%.

  • Despite this high usage, latency is very bad.

  • Processing just 10 lines → 10 requests takes far longer than expected.

  • The 20B model performs nowhere near what I’d expect on hardware like a DGX Spark.

My question:

Am I missing something in the Ollama or container configuration?
Has anyone else experienced similar behavior on DGX systems or other GPU platforms?

Thanks in advance for any insights or feedback.

No answer from me, but I experienced similarly poor performance using Ollama with oss-got-20b and 120b on the spark. Switching to lm studio was much faster.

I also tried sglang but couldn’t beat the performance from lm studio,

interested to know if there is an obvious explanation.

@ibrunton_smith @deeduckme Ollama image is slow, please try this: GDX Spark is extremely slow on a short LLM test - #5 by cosinus

I posted another thread performance - much better with llama.cpp

There is very little reason to use Ollama these days.

Llama.cpp is faster, has a decent built-in webui, has very active development, and just introduced first-party model switching on demand: New in llama.cpp: Model Management

llama-swap still allows more granular control and can control vllm and other inference engines, but it’s great to have this functionality built-in.

3 Likes

Ollama is a no-go nowadays for me. Completely pointless.

4 Likes

Just to share my experience, Ollama’s docker image have serious performance issues when running on DGX Spark. After re-installing it locally instead of the docker image the performance becomes normal.

So if your Ollama is a docker image I think this is the case. Also somehow even now webUI have weird issues when using images. I’m still looking for solutions :)

I’m very curious about this. I’m trying to do OpenAI (or Ollama) remote LLM testing. Currently testing /embeddings and the llama.cpp responses are different from an OpenAI response (JSON formatting). So it’s not really “OpenAI compatible” yet.
I would try Ollama if it can perform at least close to the llama.cpp server. What’s the best way to install Ollama locally (not docker)?

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.