No answer from me, but I experienced similarly poor performance using Ollama with oss-got-20b and 120b on the spark. Switching to lm studio was much faster.
I also tried sglang but couldn’t beat the performance from lm studio,
interested to know if there is an obvious explanation.
There is very little reason to use Ollama these days.
Llama.cpp is faster, has a decent built-in webui, has very active development, and just introduced first-party model switching on demand: New in llama.cpp: Model Management
llama-swap still allows more granular control and can control vllm and other inference engines, but it’s great to have this functionality built-in.
Just to share my experience, Ollama’s docker image have serious performance issues when running on DGX Spark. After re-installing it locally instead of the docker image the performance becomes normal.
So if your Ollama is a docker image I think this is the case. Also somehow even now webUI have weird issues when using images. I’m still looking for solutions :)
I’m very curious about this. I’m trying to do OpenAI (or Ollama) remote LLM testing. Currently testing /embeddings and the llama.cpp responses are different from an OpenAI response (JSON formatting). So it’s not really “OpenAI compatible” yet.
I would try Ollama if it can perform at least close to the llama.cpp server. What’s the best way to install Ollama locally (not docker)?