I followed https://build.nvidia.com/spark/open-webui/sync to set up Ollama and Open WebUI. I start it from Nvidia Sync and have been testing models from Ollama but none of the models uses the GPU and are extremely slow in generating anything. I can tell because nvidia-smi and DGX Dashboard show 0% of GPU usage for every model I use while btop reports 100% CPU usage.
I verified docker is running correctly with the NVIDIA Container Runtime for Docker. The only time I see the GPU got used so far is during ComfyUI testing but the set up doesn’t rely on docker.
So what am I doing wrong with Ollama with Open WebUI docker container?
Yes you can use it and Llama.cpp has also an equivalent WebII as well. Ollama and LM studio uses llama.cpp under the hood, but Ollama is clearly underperforming on the spark right now, just ignore it for now.
Those numbers are for vllm. I can compile a similar table for llama.cpp, but just as a single data point, I get 58 t/s from gpt-oss-120b. It doesn’t scale as well on the cluster though - you are actually losing performance, but able to run larger models.
Interesting. Thanks all. I had no idea ollama is that sucky and I thought the models are too big to fit into Spark’s VRAM. How big of a model (120B, 20B etc.?) can the Spark VRAM handle?
Really depends how dense is the model, its architecture, and its quantization. GPT-OSS 120B, which is a MoE architecture, runs smoothly on the spark with 4 bit quantization. Llama 3 70B will be much slower
It would be nice if we had a reference table somewhere (centrally) for guidance. Based on a reference LLM (e.g., gpt-oss-120b/Q4_K_XL), a direct comparison of the performance of llama.cpp, vllm, and sglang for a) one spark, b) two sparks (cluster), and c) x sparks (super cluster) would be quite useful (nvidia?). Based on this, we could also include the model-specific optimized parameters in the discussion.